LNCS 3232 



I Rachel Heery 
Liz Lyon (Eds.) 

Research and 
Advanced Technology 
for Digital Libraries 

8th European Conference, ECDL 2004 
Bath, UK, September 2004 
Proceedings 



4^ Springer 




Lecture Notes in Computer Science 

Commenced Publication in 1973 
Founding and Former Series Editors: 

Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen 

Editorial Board 

David Hutchison 

Lancaster University, UK 
Takeo Kanade 

Carnegie Mellon University, Pittsburgh, PA, USA 
Josef Kittler 

University of Surrey, Guildford, UK 
Jon M. Kleinberg 

Cornell University, Ithaca, NY, USA 
Friedemann Mattern 

ETH Zurich, Switzerland 
John C. Mitchell 

Stanford University, CA, USA 
Moni Naor 

Weizmann Institute of Science, Rehovot, Israel 
Oscar Nierstrasz 

University of Bern, Switzerland 
C. Pandu Rangan 

Indian Institute of Technology, Madras, India 
Bernhard Steffen 

University of Dortmund, Germany 
Madhu Sudan 

Massachusetts Institute of Technology, MA, USA 
Demetri Terzopoulos 

New York University, NY, USA 
Doug Tygar 

University of California, Berkeley, CA, USA 
Moshe Y. Vardi 

Rice University, Houston, IX, USA 
Gerhard Weikum 

Max-Planck Institute of Computer Science, Saarbruecken, Germany 



3232 




Rachel Heery Liz Lyon (Eds.) 



Research and 
Advanced Technology 
for Digital Libraries 



8th European Conference, ECDL 2004 
Bath, UK, September 12-17, 2004 
Proceedings 



4^ Springer 




Volume Editors 



Rachel Heery 
Liz Lyon 

UKOLN, University of Bath, Bath BA2 7 AY, UK 
E-mail: r.heery@ukoln.ac.uk 



Library of Congress Control Number: 200411 1112 



CR Subject Classification (1998): H.3.7, H.2, H.3, H.4.3, H.5, J.7, J.l, 1.7 
ISSN 0302-9743 

ISBN 3-540-23013-0 Springer Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer. Violations are liable 
to prosecution under the German Copyright Law. 

Springer is a part of Springer Science+Business Media 

springeronline.com 

(c) Springer-Verlag Berlin Heidelberg 2004 
Printed in Germany 

Typesetting: Camera-ready by author, data conversion by Olgun Computergrafik 
Printed on acid-free paper SPIN: 1 1319139 06/3142 5 4 3 2 1 0 




Preface 



We are delighted to present the ECDL 2004 Conference proceedings from the 
8th European Conference on Research and Advanced Technology for Digital Li- 
braries at the University of Bath, Bath, UK. This followed an impressive and 
geographically dispersed series of locations for previous events: Pisa (1997), Her- 
aklion (1998), Paris (1999), Lisbon (2000), Darmstadt (2001), Rome (2002), and 
Trondheim (2003). 

The conference reflected the rapidly evolving landscape of digital libraries, 
both in technology developments and in the focus of approaches to implemen- 
tation. An emphasis on the requirements of the individual user and of diverse 
and distributed user communities was apparent. In addition, the conference pro- 
gramme began to address, possibly for the first time, the associated themes of 
e-research/e-science and e-learning and their relationship to digital libraries. We 
observed increasing commonality in both the distributed information architec- 
tures and the technical standards that underpin global infrastructure develop- 
ments. Digital libraries are integral to this information landscape and to the 
creation of increasingly powerful tools and applications for resource discovery 
and knowledge extraction. Digital libraries support and facilitate the data and 
information flows within the scholarly knowledge cycle and provide essential en- 
abling functionality for both learners and researchers. The varied and innovative 
research activities presented at ECDL 2004 demonstrate the exciting potential 
of this very fast-moving field. 

The 148 papers, 43 posters, 5 panels, 14 tutorials and 4 workshops submit- 
ted this year were once again of the highest quality. They covered a very wide 
range of topics and were submitted from many countries reflecting the standing 
and profile of this major European conference. Our international Programme 
Committee of 70 expert reviewers carried out an exacting peer-review process to 
assure continued quality standards and to generate an outstanding conference 
programme. We were able to accept 47 papers, 4 of which were short papers, 
which equates to a 32% acceptance rate. In addition we had three leading ex- 
perts giving keynote presentations: Prof. Tony Hey (Director, UK E-Science 
Programme), Neil McLean (Director, IMS Australia), and Lorcan Dempsey 
(VP Research & Chief Strategist, OCLC). All information relating to the con- 
ference is located at http://www.ecdl2004.org/. 

We recognize that there is a huge effort required to organize a successful 
major international conference and thanks are due to many individuals and or- 
ganizations. In particular, we should like to extend our thanks to the Organizing 
Committee, the members of the Programme Committee, the additional referees, 
the conference Chairs, the invited speakers, panelists, all the presenters (panels, 
papers, posters, workshops and tutorials) and of course all the participants. We 
are grateful for the support of the University of Bath, Delos NoE, JISC, and 
MLA and for the helpful advice and guidance of many experts who willingly 
and freely gave their time and expertise for our collected benefit. 




VI 



Preface 



Finally we would like to extend our most sincere thanks to all our colleagues 
at UKOLN, who assisted and supported ECDL 2004 from conception to conclu- 
sion. Special thanks are due to Andy Powell, Greg Tourte and Richard Waller 
for their assistance in editing these proceedings. 

July 2004 Rachel Heery 

Liz Lyon 
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Abstract. This paper describes a digital library architecture and im- 
plementation that is configurable, extensible and dynamic in the way 
it presents content and in the services it provides. The design mani- 
fests itself as a network of modules that communicate in terms of XML 
messages. All modules characterize the functionality they implement in 
response to a “describe yourself” message, and can transform messages 
using XSLT to support different levels of configurability. Traditional li- 
brary values such as backwards compatibility and multiplatform opera- 
tion are combined with the ability to add new collections and services 
adaptively. The paper describes the new design and shows how it can 
be used to build four different digital library systems. We conclude by 
showing how the design fits existing interoperability frameworks. 



1 Introduction 

This paper describes a digital library design that improves upon the Green- 
stone toolkit [7]. First, it provides more flexible ways of dynamically configuring 
the run-time system and adding new services to it. Second, it lowers the over- 
head incurred by collection developers when accessing this flexibility to organize 
and present their content. Third, it modularizes the internal structure and sim- 
plifies the addition of new modules. The design is based on widely-accepted 
standards such as XML, current software practices such as simple protocols 
(like SOAP), cross-platform development strategies (Java), and contemporary 
schemes for software modularization and dynamic updates. Most important of 
all, it is informed by our experience with the current Greenstone system and the 
problems and challenges faced by real users, international collection developers, 
and practicing librarians. 

The structure of the paper is as follows. First we give some background out of 
which the requirements for the digital library software arose. Next we describe 
the new design, called Greenstone3, and discuss how it meets the identified 
needs. Fundamental to the approach is the use of XML throughout for data 
representation, combined with XSL Transforms to provide a flexible mechanism 
for adjusting the functionality of the runtime system without having to modify 
and recompile the source code. To promote cross-platform independence (which 



R. Heery and L. Lyon (Eds.): ECDL 2004, LNCS 3232, pp. 1—13, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




2 



David Bainbridge et al. 



has consumed inordinate human resources in the present Greenstone system) and 
facilitate the dynamic loading of services and other modules, the implementation 
uses Java. 

Following this we describe four very different examples built using the new 
design. The first demonstrates backwards compatibility, and, in addition, op- 
erates within a distributed environment. The second augments text at display 
time, using an established software tool for text mining developed elsewhere, to 
show how functionality can be enriched through the introduction of a new ser- 
vice. The third is a map-based digital library that introduces two new services to 
support geographic functionality and, while still accessed through a web-based 
interface, is highly tailored to the specific domain of its source documents. The 
fourth is a radically different interface for the interactive viewing of search results 
as a hierarchically organized cluster of documents, and illustrates the versatility 
of the design in coping with fine-grained interaction. We conclude with a com- 
mentary on how the design fits existing interoperability frameworks, from which 
many valuable lessons have been learned. 

1.1 Background 

Over the years the Greenstone digital library software has been employed by 
many users internationally to develop a wide variety of digital libraries. In ad- 
dition to operation over the Web, an early application, and still a major one, 
is collections of humanitarian information in the form of CD-ROMs that run 
on any Windows computer (including Windows 3.1). There are now over two 
dozen of these collections, and they have been distributed widely by the United 
Nations and other non government agencies [9]. 

Many other styles of collection have been developed under Greenstone. They 
range from numerous personal collections based around common document for- 
mats such as E-mail, photographs, Word, and PDF documents through to large- 
scale bibliographic catalogs such as the BBC radio and TV archives. There are 
mixed media collections involving text, images and audio drawn from data such 
as historic newspapers and oral history. Several demonstration music collections 
support direct content-based retrieval through “query by humming.” Many cus- 
tomized and branded user interfaces exist, such as the New York Botanical Gar- 
dens, and many international collections in local languages created by institu- 
tions in China, India, Croatia, Russia and Israel. 

The software is distributed in source form, and for convenience binaries are 
available for Windows, Linux and MacOS X. The user interface is available in 
30 languages. A recent addition is a graphical interface for collection design 
and construction [2] which is also multilingual. A portfolio of demonstration 
collections is available at www.nzdl.org, and selected example collections built 
internationally can be found at www.greenstone.org. 

1.2 Weaknesses 

Many experimental interfaces have been built for Greenstone, some of which 
make use of a CORBA-based protocol to support distributed client-server in- 
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teraction. These include a Venn diagram tool for formulating Boolean queries 
graphically, and a bibliographic visualization tool that plots matching citations 
on an x-y grid based on publication year and ranked relevance score to the 
query terms [1]. However, while perfectly functional, some of these variants are 
implemented inelegantly, exposing limitations caused by certain aspects of the 
design. For instance, the immutable nature of the index files generated during 
the building process makes incremental adjustments to a collection expensive. 
Another hindrance is the low level of functional customization supported by the 
runtime system. Although minor presentation tweaks are easy, more extensive 
changes involve modifying and recompiling the source code. The strongly typed 
nature of the CORBA protocol was also found to be overly restrictive in many 
practical situations. 

1.3 Requirements 

From these and other considerations arose the following requirements for an 
improved design. 

Backwards compatibility. Naturally we wish to retain the existing system’s 
strengths. This is accomplished by ensuring that the new design is backward 
compatible, which has the added benefit of providing existing developers and 
users with an easy migration path. 

Levels of customization. To match the different categories of people involved 
in constructing digital libraries, different levels of customization are required. 
For instance, a content developer may wish to include source documents in 
a new format; a collection editor may seek influence over diverse issues of 
presentation. 

Software modularity. To facilitate development and long-term management 
of the software, code modularization - a mantra of any software engineering 
approach - is essential. This is promoted by adopting off-the-shelf technology 
such as a database system, indexing tools, and page rendering software; and 
by the use of standards. 

Service based. Basing a digital library around a set of services is another way 
to accomplish modularity - in this case, modularity of function. 
Distributed architecture. A rich digital library infrastructure can only be 
supported by a distributed architecture, and the addition of an open protocol 
helps to foster interoperability. 

Future compatibility. Libraries are long-term institutions with a mandate for 
preservation, and it is essential that old collections can be presented by 
future versions of the system. This is a more ambitious requirement than 
mere extensibility which, although an admirable quality in any design, does 
not necessarily ensure that future versions can safely interact with current 
ones. 

Dynamic. Many aspects of the library should be dynamic, for example content, 
whereby documents and metadata can be added, revised and removed while 
a repository remains on-line, and configuration, where presentational issues 
can be adjusted and services added at runtime. 
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Integrated documentation. Large-scale software systems such as digital li- 
braries benefit immensely from the use of an integrated documentation sys- 
tem. 

Self-describing modules. This goes a stage further than the previous item: 
modules describe themselves in a machine-readable format so that other 
modules can interact with them without the need for explicit control. 

Computer environment integration. A digital library should mesh well into 
a user’s existing computer environment. Full integration makes a digital 
library become a seamless component of each user’s work environment. 



2 Software Design 

The new software design has evolved from the requirements articulated above, 
our own experience of developing digital library systems, and studies of other 
open source software and research projects. This section provides an overview of 
modularity and inter-module communication, and then works through a simple, 
stand-alone example. The next section describes four very different examples 
built using the new design, to illustrate the general applicability of the approach. 

2.1 Modularity 

We decided that the best way to meet the challenges posed by the list of require- 
ments was to reimplement Greenstone using a modular topology. In the design, 
a digital library manifests itself as a network of modules, and communication 
between them is expressed through an instantiation of XML. A mandatory re- 
quirement for a module - enforced through its base class definition - is that it 
handles “describe yourself” messages. What kinds of messages follow typically 
depend on the outcome of an initial “describe yourself” . Modules have the abil- 
ity to transform messages by applying an XSL Transform. This is a particularly 
useful mechanism for dynamically controlling levels of configurability - one of 
our requirements. Within the network of modules comprising a digital library, 
a set of services are defined that prescribe the functionality supported by that 
particular digital library configuration. 

2.2 Communication 

Modules communicate using synchronous request-response pairs. Fig. 1 shows 
an example XML exchange, where the Text Query module in a collection called 
“demo” is asked to describe itself. It does this, providing information about the 
structure of the service as well as fragments of text to be used for display. These 
are returned in the language specified in the request. While the language used 
is structured and typed, it has been crafted in such a way that the information 
it contains can be open-ended. For example, in Fig. 1 the paramList structure 
allows other items to be included, such as whether stemming is on or off. In ad- 
dition to supporting optional parameters, extensions can be introduced without 
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Crequest to=’demo/TextQuery ’ lang=’en’ type=’ describe’ uid=’21’/> 

<response f rom=’demo/TextQuery ’ lang=’en’ type=’ describe ’> 

<service name= ’Text Query ’ type= ’ query ’ > 

<displaylt em name= ’ name ’ >Search</ displayltem> 

<displayltem name=’ description ’>Full text search</displayltem> 
<displaylt em name= ’ submit ’ >Search</ displayltem> 

<paramList> 

<param name=’ index’ type=’enum_single’ default=’stx’> 

<displayltem name=" name "> Index to search in</displayltem> 
<option name="dtx"> 

<displayltem name=" name ">ent ire documents</displayItem> 
</option> 

<option name="stx"> 

<displayltem name="name">chapters</displayltem> 

</option> 

</param> 

<param name=’maxDocs ’ type=’ integer ’ default=’ 10 ’> 

<displayltem name="name">Maximum hits to return</displayltem> 
</param> 

<param name="query" type=" string" > 

<displayltem name="name">Query string</displayltem> 

</param> 

</paramList> 

</service> 

</response> 



Fig. 1 . A sample XML exchange as a request/response pair. 



having to update the protocol’s API, thereby avoiding the need to propagate 
code changes to the entire distributed network (which may require taking it 
off-line while the changes are made). For a more thorough technical descrip- 
tion of the design, see the integrated documentation that is supplied with the 
Greenstone3 source code (available through SourceForge) . 

2.3 A Simple Stand-Alone Example 

The simple stand-alone example shown in Fig. 2 encapsulates many of the de- 
sign’s key features. It comprises a “back end” server, termed a digital library 
site in our design, coupled to a “front end” that provides the user interface. The 
modules’ names were chosen to emphasize the roles they play. The Reception- 
ist’s point of contact with the server is the MessageRouter (MR) module - all 
communication with the site occurs through this module. The configuration is 
designed to generalize to a distributed environment. 

The digital library back end in Fig. 2 contains two collections, demo and fao, 
and a cluster of collection-formation services. AddDocument is a service that 
adds a document to a collection. ImportCollection imports into the system all 
documents associated with a collection, converting them where necessary from 
their original form. BuildCollection builds all indexes and browsing structures 
that are associated with a collection. ActivateCollection makes a newly-built 
collection active, so that it can be seen by digital library users. These services 
are all concerned with creating a digital library collection. Being related, they 
are grouped together into a “service cluster.” In Fig. 2 the services just listed 
are accessed through the CollectionFormation service cluster module. 

As far as the digital library user is concerned, a “collection” is a focused group 
of documents with a uniform means of access. For the system, it is a service clus- 
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Fig. 2. A simple stand-alone site. 



ter that groups a set of services that are related by the data they work on. For 
example, the demo collection towards the lower left of Fig. 2 contains three mod- 
ules providing four services to the user. The modules are called “Search,” which 
provides a TextQuery service, “Retrieve,” which provides a document retrieval 
service ResourceRetrieve and a metadata retrieval service MetadataRetrieve, and 
“Browse,” which provides a metadata browsing service ClassifierBrowse. 

The Web-based front end at the upper left of Fig. 2 centers around the 
Receptionist, which is the point of contact for the interface generator. A servlet 
(labeled “Library Servlet”) takes HTTP commands in the form of URLs and 
arguments and translates them into XML for the Receptionist. The Receptionist 
is capable of executing various different actions, each of which involves (usually) 
many calls to the digital library’s MessageRouter (center of Fig. 2). The built- 
in ability for Receptionist modules to transform XML messages using XSLT is 
used - in conjunction with a style sheet - to generate the HTML that is finally 
presented to the user. 



3 Examples 

For pedagogical purposes, the example configuration in Fig. 2 is rudimentary. 
The front and back ends (receptionist and site server) are compiled together into 
a single executable process with a single MessageRouter handling all communi- 
cation between them. However, the design supports a far richer infrastructure 
than this. MessageRouters have the capability to communicate across a dis- 
tributed network with other MessageRouters and with Receptionists. Different 
implementations of the same service can be switched in and out to give a digital 
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library site different algorithmic behavior (such as incremental building); and 
new services can be introduced and brought on-line, dynamically if necessary. 
The library can also have many different user interfaces. 

We now describe four more interesting scenarios. The first emulates “classic” 
Greenstone to demonstrate backwards compatibility, with a distributed config- 
uration a straightforward addition. The second augments text at display time, 
to show how functionality can be enriched through the introduction of a new 
service. The third and fourth utilize novel interfaces to enrich the interaction: 
the former is specialized for geographic maps, and the latter for text based col- 
lections. 

3.1 Classic Greenstone, with Collections on Remote Sites 

To provide backwards compatibility, a set of Greenstone3 services have been 
implemented that access index and database files generated by a Greenstone2 
system. Indeed we have already seen these in Fig. 2. Moreover, using Cascad- 
ing Style Sheets (CSS), the look and feel of the traditional interface has been 
matched one-for-one, so that the user’s experience of the digital library remains 
the same. Of course with the more versatile design, more sophisticated internal 
structures can be built within the digital libary. For instance, because of the 
design decision that all communication be XML based, it is straightforward to 
arrange for the XML messages sent between modules to be seamlessly streamed 
across remote machines. For this we employ the Simple Object Access Protocol 
(SOAP), but any protocol that can process text can be used. To the user there is 
no disernable difference accessing collections this way. Backwards compatability 
is not the sole aim, however, and the structure allows increasingly interesting 
digital library designs - ones that do provide a distinctly different expierence for 
the user. We discuss three such examples next. 

3.2 Dynamic Text Mining 

The next example shows how a new Greenstone3 service can be provided that 
enhances documents for the reader, based upon the open source GATE system for 
text mining [5] . The motivation is that highlighting selected items in documents 
can make it easier for users to scan them for particular pieces of information. 
In this implementation, digital library documents, once located, are mined on 
the fly marking up entities such as keywords and place names before presenting 
them to the user [8] . 

Fig. 3(a) shows a section of a book on Butterfly Farming in Papua New 
Guinea (from the Humanity Development Library at www.nzdl.org ) that the 
user is reading. The Annotate document menu at the top right calls the display- 
time text mining feature. When items on this menu are selected, text of the 
selected type is highlighted in the document displayed. In this case the user 
has selected Places and Organizations, and these are highlighted wherever they 
occur, in different colors (simulated by white-on-grey and black-on grey in the 
illustration) . 
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Fig. 3. Additional services to augment run-time functionally (a) text augmentation of 
place names; (b) a novel map-based interface. 

The selectable items in Fig. 3(a) were identified and extracted by ANNIE, 
GATE’S information extraction system. Whenever an annotation type is selected 
from the menu, Greenstone3 calls the information extraction module dynami- 
cally to identify items of that type. This incurs a delay that depends on the 
document’s size but is usually short: the system processes text at the average 
rate of 15 Kb/sec on a typical workstation. In order to achieve this the software 
is loaded when the Greenstone3 server is executed and remains resident in mem- 
ory thereafter, to avoid a start-up delay (of about 10 seconds) in which all the 
text processors are initialized and prepared for use. 

In Greenstone3, GATE is encapsulated in a module as just another service. 
It takes messages with two parameters, the type of annotation and a document, 
and returns the document with relevant items marked up with XML tags. A style 
sheet in the server module chooses the colors in which the items are displayed. 

In order to generate the menu in the figure, a “describe yourself” message 
is sent to the GATE module, which returns a list of the annotations that are 
available. Apart from this one module, the rest of the digital library system 
knows nothing of GATE. However, it does know about a general class of ser- 
vices, Enrich, that take a document and return the same document with some 
elements marked up. To add this text-mining ability to an existing Greenstone3 
site dynamically the Java class files for the new service need to added to the 
correct folder on the DL site’s file system, the site’s configuration file updated 
to name the new service, and a “reconfigure” signal sent (which can be initiated 
through a web browser) to the DL site server. 



3.3 Geographical Map Based Interface 

Fig. 3(b) is taken from a site configured to support map searching based on 
a small collection of New Zealand maps drawn between 1770 and 1953, part 
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Fig. 4. Applet based Dcndro visualization of clustered documents matching query. 



of our university library’s special collections. It shows the result of searching 
for “Rangiriri” - a township in the North Island of New Zealand and, on a hill 
nearby of the same name, the scene of a battle in 1863 between Maori and British 
soldiers. Two crosses (colored red when viewed online) have been overlaid on the 
map showing the position of the township and hill, and can be seen slightly 
above the center of the map. 

The query interface and presentation of the results is similar to the tradi- 
tional interface, except that against each matching item a thumbnail of the map 
containing the place name is given. Fig. 3(b) is the result of honing in on one 
such map appearing in the result set. Through the interface the user can change 
the level of magnification of the map and request that additional features such 
as cemeteries, bridges, and estuaries be marked. 

This new functionality is provided by adding two new services, including 
geographic metadata with the collection’s source documents (the maps), and a 
gazetteer of over 40,000 entries from Land Information New Zealand (LINZ). 
Searching is handled by the MapQuery service. For each place name query term 
that is located in the gazetteer, its longitude and latitude are checked against 
each map to see if it falls within its boundaries. Once all cross-referencing is 
complete, the list of maps with matching entries is sorted by the number of query 
terms on each map. Image manipulation and the results display are handled 
by the MapRetrieve service, which makes use of the open source ImageMagick 
toolkit for copying, annotating (such as placing a cross at an x , y location) and 
resizing images. 



3.4 Cluster Visualizations 

Fig. 3 shows a web-based interface, but there may be other forms, ranging from 
standalone applications and applets that display documents in different ways 
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to alert services that recognize when new information becomes available in one 
of the collections and formulate appropriate E-mail to users. Fig. 4 is one such 
instance: a Java applet that supports browsing result sets as document clusters 
through a multiviewed, interactive interface [3]. To function it needs the addition 
of a new build-time service, VisConstruct , which is responsible for computing 
additional statistics about the collection necessary to perform subsequent doc- 
ument clustering operations, and a sub-classing of the Applet service, tailored 
for the HTML necessary to embed the visualization applet within a web page. 
Besides this, the visualization can interact with any text-based Greenstone col- 
lection that defines keyword metadata. The distinctly different interface behav- 
ior is achieved through fine-grained interaction with the existing TextQuery and 
MetadataRetrieve services using SOAP. 

Derived from the query term entered, there are four views in the visualization. 
The first is the standard result list format; the other three are based around a 
shared model of hierarchically clustered documents. In Fig. 4 the user has entered 
the query term “recession” to a collection based on the Wall Street Journal and 
is viewing the result as a Dendro map - the circular arranged tree displayed in 
the main panel - which contains words like slowing , slowdown and inflationary 
around the circumference. 

The size of nodes in the tree represents the number of documents clustered 
at that point in the hierarchy, and for reasons of space only 5 levels of the tree 
are shown at a time. Clicking on a node shows a popup menu with the precise 
number of documents in the cluster and the option of “drilling down” as shown in 
the figure. The consequence of choosing this option is for the tree to be redrawn 
with the selected node at the center of the map, thereby exposing more of the 
detail in that part of the hierarchy. While all this is going on, the left-panel 
displays the keywords relevant to the selected cluster, and the bottom panel 
lists the documents involved. Within the left panel, keywords can be activated 
or deactivated and the bottom panel dynamically filtered to reflect the changes. 



4 Interoperability Frameworks 

This modular, service-based and dynamically configurable approach to digital 
library architecture can be found in other designs and systems. Two key exam- 
ples are the Extended Open Archives Initiative protocol (XOAI) proposed by 
Suleman and Fox [6] and the OpenDLib system of Castelli and Pagano [4] . 

The XOAI protocol, based upon XML and related technologies, builds upon 
the basic OAI Protocol for Metadata Harvesting. It is a modular protocol which 
represents each service as a different type of OAI request. This service-based 
approach is readily seen in Greenstone3’s protocol, where each service has its own 
request and response format. Like OAI and XOAI, Greenstone3 uses XML for 
representing requests and responses; providing a gateway between Greenstone3 
and XOAI servers and clients should be readily achievable. 

The OpenDLib system also takes a modular approach to service communica- 
tion and library construction. Many of its features are similar to those of XOAI 
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and Greenstone3, and the handling of requests to the library map as readily as 
for XOAI. However, OpenDLib is centered upon a “manager” service that coor- 
dinates services, responding to the introduction of new services and alterations 
to existing ones. Coordination is achieved through a communication protocol 
between the manager and each service. Individual services, including each li- 
brary in a federation, acquire data from the manager. The manager does not 
itself control or direct requests for searches, documents, and so on but instead 
coordinates between services to ensure consistency. Greenstone3 provides many 
of these features - for instance, service description requests - without requiring 
a centralized manager. 

The similarity of protocols between Greenstone3, XOAI and OpenDLib 
echoes experience with the previous generation of DL protocols and augers well 
for interoperability. Greenstone’s widespread adoption and free availability al- 
lows others to adopt it as a platform for experimenting with the service-centered 
design approach. It facilitates the addition of new digital library features which 
would have been hampered by the less open structures of Greenstone2 and its 
contemporaries. Greenstone3 retains and extends the lightweight configurability 
of its predecessor without requiring centralized management processes. 



5 Conclusions 

This paper has presented a new digital library design which is significantly dif- 
ferent from the present Greenstone, but strongly rooted in past success. To 
meet a broad range of requirements a modular based topology is utilized to pro- 
vide a flexible infrastructure. Written in Java to promote portability, dynamic 
loading of objects and internationalization, modules communicate by streaming 
XML messages between each other. Using SOAP this communication can be dis- 
tributed across a network. All modules have the ability to describe themselves 
in a machine readable form, and to apply an XSLT to transform messages. The 
latter is instrumental in providing different levels of configurability, an important 
ability given the different types of people involved in the life-cycle of a digital 
library. 

Several working examples have been given to convey the general applicabil- 
ity of the design. The first example demonstrated backwards compatibility to 
collections built with Greenstone2, and satisfies another of the identified require- 
ments: to help minimize the migration path of existing developers and users. It 
also demonstrated the design functioning in a distributed environment. The sec- 
ond example augmented an existing DL site with text mining capabilities with 
the introduction of a new service. The third and fourth examples illustrated 
substantially different user interfaces: one specialized for maps that required 
two extra services to provide the necessary geographical functionality; the other 
provided an interactive visualization to search results through a Java applet that 
was applicable to a wider realm of text-based collections. 

Numerous aspects of the design contribute to its dynamic configurability. The 
“describe yourself” feature allows receptionists to adapt to the facilities that a 
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server presents. Through XSLT a corpus editor can exert significant control over 
the function and appearance of a digital library such as modifying the structure 
of a generated page, for example adding a transform that sorts the query results 
alphabetically by title rather than ranked score. Alternatively they may reformat 
the results according to OAI syntax in preparation for exporting. Previously 
such changes required editing the source code and restarting the digital library 
software after recompilation. 

Implementing the design in Java has promoted dynamic attributes in the sys- 
tem, particularly through the ease in which modules can be loaded at runtime. 
The re-reading of configuration files is also straightforward to arrange and adds 
dynamic abilities that a digital library administrator can take advantage of. In 
addition to integrating the documentation with the software using JavaDoc and 
other techniques, usability of the software also benefits from the development 
of self-documenting collections that both demonstrate and describe particular 
aspects of design. The list of requirements is ambitious, however, and not all of 
them have been proven yet. For example, while none of the sample implemen- 
tations demonstrate computer environment integration, the design is capable of 
supporting the idea through specialized receptionists. For future compatibility, 
more time is needed to establish if the design successfully meets such a criteria. 
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Abstract. This paper describes the MILOS Multimedia Content Man- 
agement System: a general purpose software component tailored to sup- 
port design and effective implementation of digital library applications. 
MILOS supports the storage and content based retrieval of any multime- 
dia documents whose descriptions are provided by using arbitrary meta- 
data models represented in XML. MILOS is flexible in the management 
of documents containing different types of data and content descriptions; 
it is efficient and scalable in the storage and content based retrieval 
of these documents. The paper illustrates the solutions adopted to sup- 
port the management of different metadata descriptions of multimedia 
documents in the same repository, and it illustrates the experiments per- 
formed by using the MILOS system to archive documents belonging to 
four different and heterogenous collections which contain news agencies, 
scientific papers, and audio/video documentaries. 



1 Introduction 

Digital Library (DL) technology is today limited to manage specific types of dig- 
ital objects and specific metadata description models. This implies that existing 
DL applications can be hardly adapted to different application environments and 
to different metadata description models. Indeed, many DLs were built having 
in mind a specific application and, in many cases, a specific document collection, 
thus resulting in an ad-hoc solution: all components of the DL - the data repos- 
itory, the metadata manager, the search and retrieval components, etc. - are 
specific to a given application and cannot be easily used in other environments. 
Many of these systems guarantee inter-operability with other systems, by adopt- 
ing standard protocols such as OAI, or Z39.50. However, their inter-operability 
is limited to the exchange (import/export) of data/metadata. In fact, there is no 
chance of reusing software components, to integrate functionality of other DLs, 
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like to thank Paolo Bolettieri, Franca Debole, Fabrizio Falchi, Francesco Furfari, and 
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or to use digital contents (documents and metadata) compliant to other stan- 
dards. This is mainly due to the lack of standard general purpose basic building 
components tailored to DL application design. 

In this paper we propose an approach similar to that applied in the field 
of traditional database applications. In fact, database applications are generally 
built relying on a Database Management System (DBMS), a general purpose 
software module that offers all functions needed to build many different database 
applications (e.g., banking, corporate management, billing, etc.); these applica- 
tions will use different types of data, and they will support many different types 
of retrieval. We intend to demonstrate that the same can be done in the DL 
field: it is possible to build a general purpose Multimedia Content Management 
System (MCMS) which offers functionalities specialized for DL applications. Dif- 
ferent DL applications, can be built on top of such an MCMS, each supporting 
the management of documents of any data type, described by using different 
metadata description models, searchable in many different modes. This MCMS 
should be able to manage not only formatted data, like in databases, but also 
textual data, using Information Retrieval technology, semi-structured data, typ- 
ically in XML, mixed-mode data, like structured presentations, and multimedia 
data, like images and audio/video. 

In this paper we discuss the functionality that the MCMS should provide 
(Section 2) and we present the MILOS Multimedia Content Management System 
(Section 3), that we built according to those criteria. Finally we present several 
significant DL applications that were implemented by using MILOS, and we 
show the advantages of the proposed approach in building these specific DL 
applications, resulting in the simplicity of the implementation and in significant 
system performance (Section 4). 

2 Motivations 

Digital library applications are document intensive applications where possibly 
heterogeneous documents and their metadata have to be managed efficiently and 
effectively. We believe that the main functionalities required by DL applications 
can be embedded in a general purpose Multimedia Content Management System 
(MCMS), that is a software tool specialized to support applications where doc- 
uments, embodied in different digital media, and their metadata are efficiently 
and effectively handled. 

The minimal requirements of a Multimedia Content Management System are 
Flexibility , in structuring both multimedia documents and their metadata, Seal- 
ability, and efficiency. Flexibility is required to manage both basic multimedia 
documents and their metadata. The flexibility required in representing and ac- 
cessing metadata can be obtained by adopting XML as standard for specifying 
any metadata (for example MPEG-7 [5] can be used for multimedia objects, or 
SCORM [11] for e-Learning objects). Requirements of scalability and efficiency 
are essential for the deployment of real systems able to satisfy the operational 
requirements of a large community of users over a huge amount of multimedia 
information. 
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A MCMS mainly supports the storage and preservation of digital documents, 
and their efficient and effective retrieval and management. This is provided with 
an appropriate management of documents and related metadata, by: 

1. managing different documents embodied in different media and stored with 
different strategies; 

2. supporting the description of document content by way of arbitrary, and 
possibly heterogeneous, metadata; 

3. providing DL applications with custom/personalised views on the metadata 
schema actually handled. 

Point 1) requires that no assumption should be taken on the types of media 
and encoding used to represent documents, and especially on the specific strat- 
egy used to store them. This allows applications to be unaware of the technical 
details related to multimedia document management. For instance, textual doc- 
uments can be stored in the file system and served to the users using a normal 
web server. However, video documents might need to be maintained in a video 
server that uses various storage devices, as for example digital tapes stored in 
silos, optical disks, and/or temporary storage space on arrays of hard disks [21]. 
In addition, video documents might be served exploiting specific real-time con- 
tinuous media streaming strategies to avoid hiccups during playback. The DL 
application should be designed independently from these issues, which should 
be managed transparently by the MCMS. For instance, changes in the storage 
strategies should be possible without changing the DL application software. 

Point 2) states that a content management system should be able to deal with 
arbitrary metadata. This is required by the fact that different DL applications, 
according to their specific requirements, might need to use different metadata. 
Consider that existing archiving organizations have already their own metadata 
schemas, and hardly want to modify them to be compatible with a specific 
system. Therefore, a DL management system should be able to support any 
metadata schema without requiring metadata translation or restrictions on the 
functionality offered. There are also cases where the same application needs to 
deal with different metadata at the same time. These different metadata might be 
needed because the documents have redundant descriptions in terms of different 
metadata, or because the DL application is dealing with a document collection 
described with heterogeneous metadata. The last case might occur, for instance, 
in case of integration/merging of archives managed by different organization. 

Point 3) makes it possible that the metadata schema seen by the DL appli- 
cation is different from the metadata schemas actually stored in the repository 
of the content management system. Suppose that an application was built to 
deal just with a specific metadata schema. The MCMS should be able to serve 
requests of such an application even if metadata stored in the repository com- 
ply to different schemas. Metadata schema independence can be obtained by 
exploiting techniques of schema mapping. This feature is especially useful in 
case of heterogeneous metadata available at the same time in the repository: the 
DL application will refer to just one metadata schema, relying on the multiple 
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schema mapping performed on the fly by the MCMS. In addition, this feature 
allows different DL application, which require different metadata schemas, to 
share the same MCMS transparently. 

3 The MILOS Multimedia Content Management System 

We have designed and built MILOS (Multimedia digital Library Object Server), 
a MCMS that satisfies the requirements and offers the functionalities discussed 
in previous section. The MILOS MCMS has been developed by using the Web 
Service technology, which in many cases (e.g. .NET, EJB, CORBA, etc.) already 
provides very complex support for “standard” operations such as authentication, 
authorization management, encryption, replication, distribution, load balancing, 
etc. Thus, we do not further elaborate on these topics, but we will mainly con- 
centrate on the aspects discussed above. 

MILOS is composed of three main components as depicted in Figure 1: the 
Metadata Storage and Retrieval (MSR) component, the Multi Media Server 
(MMS) component, and the Repository Metadata Integrator (RMI) component. 
All these components are implemented as Web Services and interact by using 
SOAP. The MSR manages the metadata of the DL. It relies on our technology for 
native XML databases, and offers the functionality illustrated at point 2) above. 
The MMS manages the multimedia documents used by the DL applications. 
MMS offers the functionality of point 1) above. The RMI implements the service 
logic of the repository providing developers of DL applications with a uniform 
and integrated way of accessing MMS and MRS. In addition, it supports the 
mapping of different metadata schemas as described at point 3) above. All these 
components were built choosing solutions able to guarantee the requirements of 
flexibility, scalability, and efficiency, as discussed in the next sections. 




Fig. 1 . General Architecture of MILOS. 
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3.1 Metadata Storage and Retrieval 

A typical search in a DL is performed on metadata which describe the document 
content end their bibliographic information. Three different approaches have 
been adopted until now to support document retrieval in digital libraries: (a) use 
of relational databases; (b) use of information retrieval engines; (c) full sequential 
scan of metadata records. Unfortunately, these approaches did not prove to be 
effective for DL applications: designers had to face the problem of choosing the 
right compromise between efficiency of the search systems and complexity of the 
metadata schema. The result of this compromise is that in mostmany cases DLs 
use very simple and flat metadata schemas such as Dublin Core [2]. 

Solution (a) requires that metadata should be converted into relational sche- 
mas. This is easy for simple flat metadata schemas, such as Dublin Core, but it far 
more difficult for complex and descriptive metadata schemas, such as ECHO [14] , 
MPEG-7 [5], IFLA-FRBR [10], P/META [13]. Moreover, a query on these meta- 
data must be translated into complex SQL queries at relational level, resulting 
in many expensive joins to implement tree structure traversals. Thus, the result- 
ing search performance is often unacceptable. However, even with flat metadata 
schemas, pure relational databases do not offer all functionalities needed for an 
effective retrieval, such as full text search. 

Solution (b) uses full text search engines [22] to index metadata records. In 
this case the main emphasis is devoted to the textual information contained in 
metadata fields. Many text search engines offer the fielded indexing capability, 
where text contained in different fields is independently indexed. However, appli- 
cations are limited to relatively simple and flat metadata schemas. In addition, 
it is not possible to search by specifying ranges of values. 

Solution (c) is very trivial and inefficient. It is not practicable in applica- 
tions that pretend to be more than toy systems. In this case no indexing is 
performed on the metadata and the custom search algorithms always scans the 
entire metadata set to retrieve searched information. 

We successfully attempted a different approach: we have designed and im- 
plemented an enhanced native XML database/repository system with special 
features for DL applications. This is especially justified by the well known and 
accepted advantages of representing metadata as XML documents. Metadata 
represented with XML might have arbitrary complex structures, which allows 
to deal with complex metadata schemas, and might be easily exported and im- 
ported. Our XML database can store and retrieve any valid XML document. No 
metadata schema or XML schema definition is needed before inserting an XML 
document, except optional index definitions for performance boosting. Once an 
arbitrary XML document has been inserted in the database it can be immedi- 
ately retrieved using XQuery. This allows DL applications to use arbitrary (XML 
encoded) metadata schemas and to deal with heterogeneous metadata, without 
any constraint on schema design and/or overhead due to metadata translation. 

We decided not to use a commercial XML database system (e.g. Tamino [9]) 
because of our specific operational requirements: 
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1. Particular attention must be given to the performance of search and insert 
operations. 

2. It is not necessary to enforce a database-like transactional mechanism, since 
update operations are quite rare compared to search operations. Editing of 
complex multimedia objects and their metadata, can be based on a sort of 
check-out /check-in mechanism. 

Thus, our native XML database/respository system is simpler than a general 
purpose XML database system, but offers significant improvements in specific 
area: it supports standard XML query languages such as XPath [18] and XQuery 
[19], and offers advanced search and indexing functionality on arbitrary XML 
documents; it supports high performance search and retrieval on heavily struc- 
tured XML documents, relying on specific index structures [15, 23], as well as full 
text search [22], automatic classification [20], and feature similarity search [17]. 
The system administrator can associate an index to a specific XML element. For 
instance, the tag <abstract> can be associated with a full text index and to an 
automatic topic classifier that automatically indexes it with topics chosen from 
a controlled vocabulary. On the other hand, the MPEG-7 <VisualDescriptor> 
tag can be associated with a similarity search index structure and with an au- 
tomatic visual content classifier. The XQuery language has been extended with 
new operators that deal with approximate match and ranking, in order to deal 
with these new search functionality. 

In our database every XML document is identified by an URN. Therefore, 
relationships and links among documents - even if they are stored in different 
repositories - can be easily and unambiguously represented. 

3.2 Multi-media Server 

Different DL applications may have different storage and access needs. For exam- 
ple, very small DLs might store documents on standard hard disks, while more 
mission critical applications might need to store documents on arrays of disks, 
possibly duplicating and distributing content on several sites. Digital libraries 
dealing with huge archives of video documents, might need to store them on 
digital tapes maintained in silos, and to have arrays of disks used as temporary 
storage for frequently used documents. In addition, we must consider that a DL 
may scale over time, when the number of documents grows over a certain limit 
or faster access is needed. 

DL applications might also use different delivery strategies. For example, 
a small DL might serve documents using a normal web server, while heavily 
accessed DLs might need to use replication and load balancing strategies to 
guarantee high performance access to content. A video DL might use high per- 
formance video servers to stream videos in real time to users [21]. 

The MMS allows the programmers of the DL applications to be unaware of all 
these issues. The key idea is that the DL application should deal with documents 
in a uniform way, independently of the specific strategy used to manage them. 
Thus, the MMS identifies all documents with an URN and maintains a mapping 
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table to associate URNs with actual storage locations. Applications use the URN 
to get or store documents from the MMS, which behaves as a gateway to the 
actual repository that stores the document. The system administrator can define 
rules that make use of MIME types, to specify how the MMS has to store a 
document of a specific type. For example, the rule may specify that an MPEG-2 
video has to be be stored in a tape of a silos, while an image will be stored in 
an array of disks. 

A special care is taken to deal with the actual access protocols offered to 
retrieve the documents. An application will refer a specific document always 
using its URN. However, the retrieval of the document should be done using 
an access protocol compatible with the storage and delivery strategy associated 
with the document. For instance, when the document is stored in a web server 
it will be retrieved with an HTTP request. On the other hand, suppose that a 
video document is served through a commercial video server such as the Helix 
Universal Server [4]; in this case the real time streaming of the video will be 
obtained using RTSP [6]. When an application requires to retrieve a document, 
the MMS will translate the given URN into a specific handle (for instance an 
RTSP URL) that the application will use to access the document. 

3.3 Repository Metadata Integrator 

The RMI manages the accesses to the document and metadata repositories and 
supports metadata mapping to guarantee metadata independence. The mapping 
of application requests into requests compatible to the metadata schema actually 
managed by the MCMS is accomplished by defining a set of schema mapping 
rules. The main purpose of this mapping is to translate application requests into 
XQuery queries compliant to the stored metadata. This mechanism allows the 
RMI to translate names of fields (such as Title, Author, etc.) known to the DL 
application, into requests to the MSR without the need of knowing the specific 
schema model adopted. When a new XML schema is introduced, the system 
administrator must specify the mappings for the new metadata. 

Each mapping rule specifies how to translate the name of a metadata field, 
known to the application, into an XPath expression that specifies the XML path 
names that should be used to access that metadata field in the target metadata 
schema. A generic mapping rule has the following structure: 
metadataType[.Name]* = <RE_XPath> ,<SE_XPath> where 

1. The metadataType held identifies the metadata model used by the applica- 
tion e.g. DublinCore, SCORM, MPEG-7 etc; 

2. Name is the name of a metadata held requested by the application e.g., 
Title, Author, etc. If empty, it means that the rule applies to all metadata 
helds of the specihed metadataType ; 

3. <RE_XPath> (Retrieved Element XPath) is the XPath corresponding to 
the XML element that will be retrieved with this held; 

4. <SE_XPath> (Searched Element XPath) is the XPath, under <RE_XPath>, 
of the element that contains the value of the metadata held used for search- 
ing. 
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As an example, let us consider a DL for e-Learning applications, where the 
Learning Objects in the repository have a complex metadata structure, based on 
SCORM [11]. Suppose that we want to search SCORM metadata trough Dublin 
Core. We can use the following mapping rules: 

dc. title = /lom, general/title/langstring 

dc . description = /lom, general/description/langstring 

They specify that the Dublin Core metadata fields ‘dc. title’ and ‘dc. description’ 
can be searched in SCOR.M respectively by means of the XPath string lom/ 
general/title/langstring, and lom/genaral/description/langstring. 
The whole <lom> element will be retrieved when <langstring> contains the de- 
sired value. Note that, the <title> and <description> SCORM XML elements 
do not contain the title text of the document, but the element <langstring>, 
which in turn contains the real text. 

Let us now explain how the mapping directives are used by RMI to generate 
the XQuery query. The RMI allows applications to search on metadata by using 
the findExactMatch method: 

findExactMatch( string MetadaType, vector of string fields, vector of 
string values, string returnFields), 

This method searches for a set of metadata records of the specified MetadaType. 
The fields parameter is a vector of (application known) names of metadata fields, 
of the MetadaType, to search for. The values parameter specifies the values that 
the fields must match (the different fields are searched by using the boolean con- 
nective AND). Finally, returnFields specifies the fields of the retrieved records 
(i.e. RE_XPath) that the application wants to know. The method translates 
the request into an XQuery query as follows: 

1. for each triple <MetadaType, valuei, fieldi>, specified in the findExact- 
Match, RMI searches the mapping rules matching MetadaType. fieldi to 
fetch the corresponding XPath strings REECPathi and SEECPathp, 

2. for each pair <MetadaType, returnFieldi >, specified in the findExact- 
Match, RMI searches the mapping rules matching MetadaType. 
returnFieldj to fetch the corresponding XPath strings REECPath retj and 
SE_XPath retj . 

3. check that all the strings RE EX Pathi and RE_XPath re t , are the same string 
and call that string REECPath, otherwise fail and stop; 

4. finally, combine the XPath strings REECPath , SE_XPathi, and 
SE_XPath retj to generate the XQuery query, as follows: 

for $a in RE EX Path 

where %a/ SEECPathi = value \ and . . . and %a/ SEECPath n = value n 
return $a/ SEECPath retl ... $a/ SEECPath retm 

Example: Suppose that an application wants to use Dublin Core to search 
SCORM metadata having a specific title, and wants to have back the corre- 
sponding descriptions. In this case we have MetadataType = dc, fieldi = title, 
returnFieldi = description. Applying the previous mapping rules we obtain: 
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for $a in /lom 

where $a/general/title/langstring = valuei 
return $a/ general/description/langstring 

4 Field Trials 

In order to verify and demonstrate the flexibility and efficiency of MILOS in 
managing different heterogeneous DL applications, we took four data sets used 
by four different existing DLs and we built the corresponding DL applications 
on top of MILOS. The data sets that we considered consist of documents and 
metadata of very different nature: the Reuters data set [7], the ACM Sigmod 
Record dataset [8], the DBLP data set [1], and the ECHO data set [14]. 

The DL applications that we built use the same MILOS installation and all 
data sets were stored together. The functionality of MILOS allows individual 
applications to selectively access data and metadata of their interest or to per- 
form cross-library search. Each DL application consists of a specific search and 
browsing interface (built according to the data managed) and a bulk import 
tool. The search and browse interfaces were built as web applications using Java 
Server Pages (JSP). The bulk import tool was a simple Java application. On av- 
erage, the effort required to build each application from scratch was one week of 
work of a single skilled person. This, we believe, is really a little effort compared 
to the cost that would have been required to build from scratch a DL, without 
general purpose tools, or the cost that would have been required to translate and 
adapt the data and metadata to cope with the requirements and restrictions of 
an existing DL system. 

We built the browse and retrieval interface from scratch. However, we are 
currently working to develop a tool supporting the automatic generation of the 
browsing and retrieval interface according to data and metadata fields. This will 
contribute to a further reduction of the cost of building DL applications. 

All applications resulted to be very efficient. We installed the system, the 
applications, and the data on a single computer equipped with a Pentium 1.8 
GHz and 1 Gb of RAM, running Windows 2000 server. We have used JAX-RPC 
as SOAP application server to run MILOS. Applications have been tested by 
30 users operating at the same time from remote workstations, and executing 
a predefined search intensive job. On average the response time of the system 
was below 1 second. Notice also that for more intensive uses of the system, the 
underlying Web Service technologies offer plenty of solutions to guarantee seal- 
ability exploiting techniques of replication, load balancing, resource/connection 
pooling etc. 

The Reuters data set [7] contains text news agencies and the corresponding 
metadata composed of Reuters specific metadata including titles, authors, topic 
categories, and extended Dublin Core metadata. The data set contains 810,000 
news agencies (2.6 Gb) with text and metadata both encoded in XML. We 
associated the full text index and the automatic topic classifier to the elements 
containing the body, the title, and the headline of the news. Other value indexes 
were associated with elements corresponding to frequently searched metadata, 
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such as location, date, country. The search interface allows the user to perform 
integrated text, category, and exact match search. 

Both the ACM Sigmod Record [8] and the DBLP data-sets [1] consist 
of metadata corresponding to the description of scientific publications in the 
computer science domain. The ACM Sigmod record is composed of 46 XML files 
(1Mb), while the DBLP data-set is composed of a sinlge large (187Mb) XML file. 
Their structure is completely different even if they contain information describing 
similar objects. For these two datasets we built one single DL application that 
allows one to access both. The MILOS mapping is used to translate application 
requests in the two schemas. We associated a full text index to the elements 
containing the titles of the articles, and we associated other value indexes to 
other frequently searched elements, such as the authors, the dates, the years, 
etc. The search and browse interface allows users to search for articles by various 
combinations full text and exact/partial match of elements. In addition it allows 
user to browse results by navigating trough links (and implicitly submitting new 
queries to MILOS) related to the author, journal, conference, etc. 

The ECHO data set [14] includes historical audio/visual documents and the 
corresponding metadata. ECHO is a significant example of the capability of MI- 
LOS to support the management of arbitrary metadata schemas. The metadata 
model adopted in ECHO, based on IFLA/FRBR [10] model, is rather complex 
and strongly structured. It is used for representing the audio-visual content of 
the archive and includes among others, the description of videos in English and 
in the original language, specific metadata fields such as Title, Producer, year, 
etc., the boundaries of scenes detected (associated with a textual descriptions), 
the audio segmentation (distinguishing among noise, music, speech, etc.), the 
Speech Transcripts, and visual features for supporting similarity search on key- 
frames. The collection is composed of about 8,000 documents for 50 hours of 
video described by 43,000 XML files (36 Mb). Each scene detected is associated 
with a JPEG encoded key frame for a total of 21GB of MPEG-1 and JPEG files. 
Full text indexes where associated to textual descriptive fields, similarity search 
index where associated with elements containing MPEG-7 image (key frames) 
features, and other value indexes where associated with frequently searched ele- 
ments. The search and retrieval interface (Figure 2) allows users to find videos 
by combining full text, image similarity, and exact/partial match search. Users 
can browse among scenes, and corresponding metadata. The original ECHO DL 
application [3], was built using a relational database, and translating all meta- 
data in a relational schema. Even simple searches required several (up to 10 or 
more) seconds to be processed. With MILOS we had a dramatic improvement 
of performance, being able to serve requests in less than one second even with 
several users accessing the system. 



5 Conclusion 

This paper described the architecture of the MILOS Content Management Sys- 
tem and the solutions adopted to obtain a system that is flexible in the manage- 
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Fig. 2. The ECHO retrieval interface implemented in MILOS. 



ment of documents with different types of content and descriptions, and that is 
efficient and scalable in the storage and content based retrieval of these docu- 
ments. In particular, we described the approach adopted to support the manage- 
ment of different metadata descriptions of multimedia documents in the same 
repository. This goes towards the solution of the challenging problems of in- 
teroperability among different metadata descriptions. The proposed solution, 
based on the use of a mapping mechanism among the metadata fields of the 
different models, has been practically experimented by using the MILOS system 
to archive documents belonging to four different and heterogenous collections 
which contain news agencies, scientific papers, and audiovideo documentaries. 
The archiving of these documents was straightforward and it only required the 
creation of the mapping file and the development of the user interfaces to archive 
and to search the documents. 

An evolution of this activity is foreseen in several directions: on one side we 
are working to improve the retrieval capabilities of the Metadata Storage and 
Retrieval component; on the other side, we are working with partners of the ECD 
[12] project on the automatization of the mapping between different metadata 
schemas, by using thesaurus and cross-language vocabularies [16]. 
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Abstract. Athens University recently initiated a digital collection development 
project to provide enhanced educational capabilities. Collections vary in terms 
of the material included and the requirements imposed by potential users. In or- 
der to simplify collection management and promote collection interoperability, 
a common Digital Library platform should be employed to support all collec- 
tions. In order to deal with the extended requirements imposed, Athens Univer- 
sity has decided to support an integrated Digital Library framework for multiple 
heterogeneous digital collections development. The most important require- 
ments imposed by the University’s collections are discussed in this paper, along 
with the characteristics of the Folklore Collection, one of the most complex and 
diverse ones. In order to evaluate available Digital Library systems, the Folk- 
lore Collection has been chosen as a guide for the development of two proto- 
type implementations using Fedora and DSpace. Conclusions drawn from their 
comparison and the proposed integrated Digital Library architecture based on 
Fedora are also presented. 

Keywords: Digital library architecture, managing heterogeneous collections, 
Fedora, DSpace. 



1 Introduction 

Athens University initiated a digital collection development project to gather research 
material produced by its members in order to provide enhanced educational capabili- 
ties. The material belongs to collections developed by cataloguers and researchers 
working in University libraries, administered by the Libraries Computer Centre 
(LCC). Each collection has specific scientific or cultural significance, includes differ- 
ent types of material (e.g. text and manuscripts, music, photographs, videos), consists 
of either born-digital or digitized material and satisfies diverse user requirements in 
terms of object structure, metadata and presentation. For instance, most cultural col- 
lections are archival in nature and mainly contain digitized material, while most scien- 
tific collections are constantly updated and contain both digitized and born-digital 
material. 

In order to simplify collection management and promote collection interoperabil- 
ity, it was decided to employ a common Digital Library (DL) platform to support all 
aforementioned collections. Among these collections, the Folklore Collection has 
been selected as a guide for the design of the DL system, representing one of the most 
complex collections of Athens University. Moreover its main characteristics and re- 
quirements are rather common in the majority of the other collections. Folklore Col- 
lection is described in detail in Section 2. 
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Libraries Computer Centre policy emphasizes in extending a fully customizable 
DL system, than using “out of the box” software. The basic prerequisites of this soft- 
ware are to be open-source and provide interoperable features, adequate preservation 
capabilities and support for both born-digital and digitized content. The detailed re- 
quirements, as imposed by LCC, are presented in Section 3. Based on these criteria, 
two systems were selected, namely Fedora [7] and DSpace [4], and their final evalua- 
tion, as presented in Section 4, has been based on a prototype version of the Folklore 
Collection developed on both systems. The results of this attempt lead to the inte- 
grated DL architecture designed to support Folklore Collection, as described in Sec- 
tion 5. Finally, a summary is provided in Section 6. 



2 Folklore Collection Characteristics 

The Folklore Collection is dedicated to local tradition and customs of several regions 
in Greece representing the way of living and thinking in these regions through the last 
two centuries. The Folklore Collection consists of hand-written travelling notebooks 
generated by the students of the Greek Literature Department. The same notebooks 
are further composed by notes and maps created by the author and lyrics or handcrafts 
related with a specific region. The lyrics and handcrafts included in a notebook must 
be treated both as parts of it and as independent objects belonging in a different sub- 
collection. Specifically the Folklore Collection is divided into: 

a) Notebook sub-collection. Each notebook is a manuscript written by a student after 
local research and refers to a specific area or village. The notebook is separated 
into predefined chapters and sections and includes a table of contents. Most of the 
notebooks are accompanied by drawn maps, photographs of habitants and regions, 
artifacts (e.g. laces or doles) and sound recordings with songs and folk music. 

b) Photographs sub-collection, consisted of the photographs that are inside the note- 
books as accompanying material 

c) Artifact sub-collection, exposed in the library and 

d) Sound recordings sub-collection, consisting of local music, lyrics and tale re- 
cordings related to the notebooks. 

In order to support the Folklore Collection, the following requirements should be 
satisfied: 

Sub-collection Support 

Due to the variety of material and the complex relations between Folklore Collection 
resources, the collection must be organized into sub-collections by unifying kindred 
resources according to specific criteria like the ones that Johnston and Robinson [TO] 
have indicated: i) the topic coverage, ii) the specific usage or purpose that each re- 
source has in the context of the collection, iii) the provenance, iv) the type of material, 
v) the specific spatial or temporal coverage and vi) the same category of object. The 
heterogeneity and the big amount of the resources warrant the need for separating the 
material into collections and sub-collections, as it is impossible to represent complex 
structures and to accredit rich semantics to any level by another way. By defining sub- 
collections the attributes inherited from the collection to sub-collections are identified 
and the overall collection can be easily navigated by users using various access points 
(date, subject, geographic area). 
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Collection-Level Description and Definition 

High-level collection description is important in order to help the navigation, discov- 
ery and selection of cultural content [3]. The collection-level description simplifies 
the retrieval of information because the user can decide whether the collection is of 
his interest without getting into details about the objects and also contributes to better 
administration of large collections. Thus, it is required to offer a detailed collection 
and sub-collection level description with the appropriate metadata elements, after 
specifying the structure of the collection. Specifically, the Folklore Collection is de- 
scribed by the Dublin Core Collection Description Application Profile [6], while the 
schema is extended by local elements related to custom properties. 

Representation of Composite Objects 

Every notebook contains hand-written text separated into chapters and sections along 
with a table of contents. The notebook structure is based on a predefined list of head- 
ings that correspond to specific aspects of every day life, like dressing code, eating 
and religious customs. Furthermore each notebook is accompanied by artifacts, pho- 
tographs and maps. In its written form, it is difficult to search for information inside 
the notebook. Notebooks should be represented as composite objects consisting of 
disparate parts (chapters and sections), which should be characterized individually by 
descriptive metadata. 

Description of Existing Relations 

It is necessary to represent all kinds of relations that exist inside and outside the col- 
lection throughout structural levels, in order to provide the users with all the informa- 
tion that is hidden in the collection. For example, the relation between the photograph 
referenced at a notebook page and the actual photograph belonging in the photographs 
sub-collection - probably in another format - should be identified (for example: “has 
format” or “is converted to” or “is the same with”). 

Appropriate Metadata Support 

Due to the variety of Folklore Collection material, various metadata schemes, as Dub- 
lin Core (DC) [5] and Learning Object Metadata (LOM) [8], and local fields should 
be supported. This is a strong necessity in order to keep all the valuable information 
for preservation, authenticity and retrieval of information. It is important for the users 
to have many access points to the content of the Folklore Collection and to be able to 
search by date, subject, geographic domains and the type of objects. DC is adopted as 
the basic metadata scheme, while it is further extended to cover other aspects as i) the 
technique of digitization and the technical requirements ii) meta-metadata information 
because of the heterogeneous material and iii) the educational character and the pur- 
pose of every resource. 

Selecting Folklore Collection as a Guide 

The Folklore Collection has been selected as a guide for the development of the DL 
platform due to its complex and diverse nature, since it consists of interrelated manu- 
scripts, photographs, maps, artifacts and audio. This material is not born-digital, so its 
digitization is required. This fact stands for the majority of Athens University Collec- 
tions. Moreover, this digitized content imposes certain limitations regarding user’s 
interactions (i.e. no free text search on the actual content is available), so there should 
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be provided alternate facilities for the user in order to enrich his/her capabilities. 
These enhancements will be targeted on the accurate representation of the physical 
relations and associations of the digitized content, along with its structure (chapters, 
sections, table of contents). Such entities and relations, if characterized with appropri- 
ate metadata, will eventually allow the user to acquire an enhanced view of the whole 
collection. Once again, such exploration of possibilities is better performed on the 
Folklore Collection, due to its complexity. 



3 Criteria for Selecting a Digital Library System 

This section discusses the DL System requirements imposed by LCC, covering stor- 
age / collection issues and design / implementation issues of the DL System. 

Storage Capabilities / Collection Issues 

• Basic Preservation capabilities: The DL System should handle effectively pres- 
ervation issues, by assigning persistent unique identifiers to digital objects and 
providing support for various file formats and versions for the storage of their 
content. It is also necessary to support the usage of technical metadata in order to 
describe the format of the files or the digitization process. 

• Multilingual support : The system should support at least the Greek and English 
languages, with regard to both content and metadata storage and presentation. 

• Effective Support for Digitized Content: As illustrated by the description of Folk- 
lore Collection, the system should provide the ability to handle digitized content 
effectively. Moreover, the majority of the Libraries collections considered for 
digitization in Athens University are sharing the similar dependence on digitized 
content support with the Folklore Collection (Byzantine music manuscripts, an- 
cient papyrus, etc). 

• Support for multiple, heterogeneous collections: The point described above de- 
picts the requirement for the efficient handling of multiple, heterogeneous collec- 
tions by a common centralized DL System. 

Design / Implementation Issues 

• Interoperability support: Interoperability between Athens University DL and 
other Academic DLs in Greece or worldwide is an important issue. The standard 
that should be supported to achieve interoperability of metadata is Open Archives 
Initiative Protocol for Metadata Harvesting (OAI-PMH)[l 1 ]. Furthermore, the 
system must support open standard file types, for the interoperability of the digi- 
tal content. 

• Flexibility and Expandability: The system should be flexible and expandable, 
allowing the addition of extra functionality in a straightforward manner. This is- 
sue suggests that the selected DL System should impose minimum restrictions 
regarding its usage patterns and scenarios. 

• Separation of content storage and representation / interfaces: The system should 
separate the representation logic from its core storage functionality in the highest 
possible degree. DL service must be included into an integrated web environ- 
ment, the LCC portal, so it is of great importance to be supplied with the ability 
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to “program” the interface in arbitrary ways while the DL system handles the 
storage, preservation and content retrieval issues independently. 

• Implemented in Java: The development of LCC portal has already been started 
using Java to integrate backend databases, the OP AC and collaboration software 
used by Library staff for every day activities. Usage of Java as the basic devel- 
opment language is an important issue in order to retain the integrated computing 
environment of the LCC. 

The above requirements highlight the main DL System selection criteria. No existing 
DL system could be used “out of the box” to implement Folklore Collection, or other 
collections of the same diverse and complex nature. In order to develop an effective, 
usable and integrated DL, the University is focusing on settling a long run investment 
on the selected DL System. Under this perspective, another important requirement is 
added, which refers to the ability to freely extend and customize the selected system. 
In other terms, the selected DL System should be distributed under an Open Source 
License [14]. 



4 Digital Library System Comparison 

A list of open-source institutional repository software supporting OAI-PMH is pro- 
posed in [13]. Based on the criteria presented in the previous section, two of them 
were chosen for further evaluation: Fedora and DSpace. From the seven systems in- 
cluded in this list, three of them are implemented using programming languages other 
than Java, so they were not considered for the final evaluation. From the remaining 
systems, two have no defined preservation strategy, which is a basic prerequisite for 
Athens University DL System. So we concluded to the further evaluation of Fedora 
and DSpace that are consistent to all basic requirements and already have a large 
number of installations worldwide. Fedora and DSpace are based on open and modu- 
lar architectures. The first one is using Flexible and Extensible Digital Object and 
Repository Architecture (FEDORA) [15], while the second is based on a three- 
layered architecture and a data model influenced by the Open Archival Information 
Systems (OAIS) reference model [2]. The main modules of each system provide pub- 
lic APIs to access and manage metadata and digital content. Both of these systems 
support preservation issues, by providing many digital formats of the same content, 
using technical metadata and retaining a global unique identifier to access each digital 
object. They support digitized objects, more than other platforms that are oriented on 
born-digital material, mainly electronic documents. The systems are not restricted to 
specific file formats or digital content type. 

In order to evaluate both systems, a prototype version of the Folklore Collection 
has been developed in each one. The goal was to explore each system capabilities to 
support the specific collection, using their built-in functionality. A brief description of 
each system along with the related comments and remarks of this attempt are reported 
in the following sections. 

4.1 Fedora 

Fedora is a java based open-source digital repository system [17] comprised of a 
flexible and extensible architecture. The basic entity of Fedora repository is digital 
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object. A digital object is comprised of a persistent identifier (PID), system metadata, 
one or more datastreams and disseminators that associate datastreams to behaviors. 
Datastreams are used to represent metadata or digital content. Digital objects are 
stored internally as XML files based on an extension of Metadata Encoding and 
Transmission Standard (METS) schema [12]. There are three distinct types of digital 
objects: data objects, behavior definition objects and behavior mechanism objects. 
The first one represent entities that contain the content and metadata, while other two 
define and implement the methods that present or transform the content of digital 
objects. It provides two public APIs for the management services (API-M, API-M- 
Lite) and two for access services (API- A, API-A-Lite). 

Collections are not natively supported by Fedora. In order to describe collections, it 
is practical to use a data object to represent each collection containing the appropriate 
collection description and rights metadata and the templates for the creation of data 
objects. Sub-collections can be defined in the same way and the relation between the 
parent and child collection can be described by specific structural metadata in both 
data objects. Sub-collection objects can inherit description and rights metadata from 
the parent collection. Additionally, a collection management module should be im- 
plemented that will communicate with the management API and control the afore- 
mentioned functionality. 

A composite object, such as Folklore notebook, can be represented as a number of 
data objects. Some of the data objects represent logical entities of the physical object, 
as chapters or the whole notebook view. Others represent the physical entities, for 
example the pages of the notebook. The relations between those data objects comprise 
the structure of the composite object. 

Relations are necessary to represent the structure of composite objects or to relate 
independent data objects that belong to the same or different collections. To support 
relations in Fedora, a special datastream can be used on each data object that will 
contain the structural metadata of it. A behavior object can be associated to the data 
object and describe the methods that will represent the relations in presentation level. 
These methods must be implemented in a general manner in order to support each 
relations special requirement. An extension module must also been developed over 
Fedora management API in order to manage the relations between data objects. 

Every metadata model can be described and accessed in one or more datastreams 
of the digital object. The metadata model can be a local metadata set, a standard 
metadata set or an extension of Dublin Core metadata element set. The disadvantage 
is that Fedora supports indexing and searching services, only for Dublin Core meta- 
data element set, so an external application should be used to index other metadata 
fields. 



4.2 DSpace 

DSpace is an open source digital repository system [16], implemented in lava and 
primarily focused on institutional and research material (reports, research papers and 
publications). It provides a solution for the problem of collecting, storing, preserving, 
indexing and distributing such material in digital form. 

DSpace is based on a straightforward three-tier architecture, consisting of a storage 
layer, responsible for the storage of items (digital objects) and their metadata (quali- 
fied Dublin Core metadata scheme). Digital content is stored in the file system and 
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associated to items in terms of bitstreams and bundles, allowing an item to contain 
various files. Business logic layer consists of a numerous components handling indi- 
vidual aspects of the DSpace system, such as browsing, searching, user/group man- 
agement and authorization, workflow management, content management and admini- 
stration. Finally, the application layer provides the end user interaction and interface 
functionality, in terms of web user interface, batch item importing facilities, OAI 
metadata providers and the like. Regarding preservation issues, DSpace provides 
support for CNRI handles, assuring the assignment of global persistent identifiers for 
its items. Moreover, by exploiting a simple and effective file format supporting 
scheme it provides “bit preservation” of the digital content, while “functional preser- 
vation” will be provided only for the “supported” file formats. DSpace information 
model is based on Communities, consisting of users and groups, containing Collec- 
tions, which in turn contain items (the digital objects). Finally, an adequate for most 
purposes workflow model is provided along with a simple administration toolkit. 

DSpace can be used “out of the box” for the generation of a digital repository in 
the case that content consists of independent digital documents. It provides simple, 
usable and effective resolutions of common problems, such as user and workflow 
management, persistence and indexing/searching. However, its aims do not include 
digitized content, interrelation schemes and custom metadata in a fine-grained manner 
(per collection or sub-collection, for instance). Its customization capabilities also refer 
to this context and mainly are related to: 

■ user interface arrangements 

■ installation wide modification of the qualified DC metadata element set 

■ custom workflow steps setup 

Although, DSpace provides several built-in facilities that simplify and speed-up the 
development of a digital repository, these features are highly coupled with each other 
and, mainly, coupled to the underlying database schema. For example, DSpace 
natively supports collections and the database schema reflects it by providing a dis- 
tinct Collection table, holding collection related information. Enriching this informa- 
tion, by adding more table fields is possible, easy to be accomplished and straightfor- 
ward. The same stands for analogous issues, such as metadata support, sub-collections 
or relations since all could be potentially supported by performing the necessary data- 
base schema modifications. Nevertheless, two important issues arise by performing 
such modifications: 

■ Changes should be also made to the system “core” components. In order to perform 
significant modifications to its functionality, changes should be applied to both the 
database schema and relevant code. 

■ These changes, once made, break compatibility with future releases and the rest of 
DSpace installations, limiting the ability to benefit from future improvements, addi- 
tions and extensions. 

In simple cases, such as sub-collection support, the modifications or extensions re- 
quired can be identified, designed based on current DSpace status and implemented in 
an adequately satisfying manner. In the case of more complex features, such as ad- 
vanced collection management (i.e. support of heterogeneous collections with dispa- 
rate metadata sets) the required modifications may become extremely complicated. 
The point is whether the current architecture of DSpace will stand in the way for the 
development of such features, not included in its initial design and development. Its 
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database orientation indicates that this will be the case, practically deprecating its 
current form and design and requiring a re-implementation from the beginning. 

4.3 DL System Selection 

Based on the aforementioned remarks, LCC has selected to use Fedora for the devel- 
opment of Athens University Digital Library. Its XML-based digital object model 
provides the ability to support composite objects and preserve heterogeneous meta- 
data of different categories (descriptive, technical, administrative or structural meta- 
data). Furthermore, by using behaviors on each digital object it is feasible to dynami- 
cally implement collection-specific relations, in a unified and system-oriented fash- 
ion. It is also effective to support methods to create, edit or present digital objects, 
based on collection-specific templates. Collections can be represented as common 
digital objects, providing an elegant unified representation and management scheme 
at the programming level. 

Finally, Fedora’s extensible architecture provides the ability to develop additional 
external modules, which do not intervene with its core but communicate with it 
through its public APIs. This characteristic enables the development of a DL system 
on top of Fedora, which can operate in an independent manner regarding its business 
logic, custom metadata elements and semantics, while preserving the ability to benefit 
from new versions of the “Fedora core”. 



5 Extending Fedora to Support Folklore Collection 

Based on the experiences gained from the prototype collection implementations, a set 
of modeling customizations and system extensions were considered, in order to sup- 
port Folklore Collection advanced features. The proposed architecture (presented in 
Figure 1) may be used to host other collections with similar features as well. The 
Object management and Collection management modules are used to reflect the ad- 
vanced object and collection management requirements imposed by the Folklore Col- 
lection. An external module is used to extend the indexing and searching capabilities 
of the Fedora core system over additional metadata sets. These three modules utilize 
the Fedora APIs, so there is no need to intervene with the internal Fedora Repository 
System. They act as an intermediate level between the applications that will be devel- 
oped (providing administrator, cataloguers and user presentation services) and the 
underlying Fedora system. 

The main modeling conventions adopted for Folklore Collection special features 
are discussed below. The main modules extending Fedora’s functionality are also 
described. 

Collection and Sub-collection Definition and Description 

Each collection object is represented as a common Fedora digital object, containing a 
datastream holding collection description and rights metadata and a datastream hold- 
ing the templates of the common data objects that will be created in the context of this 
collection. All datastreams are implemented as inline XML content. 
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Fig. 1 . The proposed integrated DL architecture 



The main reason for defining templates per collection is to provide guidelines for 
the creation and management of all collection data objects in a unified manner. A 
different template is used for each distinct type of data object (i.e. notebook, photo- 
graph, map). This template contains the definition of metadata fields for the data ob- 
ject and their attributes (repeatable, mandatory, indexed, etc), the description of files 
that represent digital content, the proper relations and the behaviors associated to 
disseminate content and metadata. 

An example of a digital object template from the Folklore Photographs sub- 
collection is presented in Figure 2. The template for the “photograph” data object type 
contains the following tags: 

• <field>\ defines the fields of the metadata set used by the data object type. When a 
data object is created, all these metadata fields are inserted in a datastream. 

• <file> : defines the necessary file formats for the data object. Each file is associated 
with a datastream. 

• <relation > : defines the permitted relations in which the data object is able to par- 
ticipate. In the data object, the relations will be stored in a separate datastream. 

• <disseminator > : defines the disseminators supported by this object type. A dis- 
seminator associates a behavior definition object (bdef) and a behavior mechanism 
object (bmech) with a specific datastream of the data object. 

An external module manages collections using Fedora’s management API (API-M). 
The main role of this module is to create collections and sub-collections, edit collec- 
tion description metadata and import templates for the creation of data objects. When 
creating a sub-collection the description and rights metadata are copied from the par- 
ent to the child collection. The persistent id of the parent collection is denoted on the 
content model type identifier (contentModellD) in data object’s system metadata. 
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<dobj type="photo"> 

<f ield> 

<name>dc : title</name> 

<indexed>true</ indexed> 

<mandatory>true</mandatory> 

<repeatable>f alse</ repeatable> 

<viewable>true</viewable> 

<label lang= M eng M >Photo itle</label> 

</f ield> 

<f ield> 

<name>dc : sub j ect</ name> 

<indexed>true</ indexed> 

<manda t o r y > f a 1 s e < / manda t o r y > 

<repeatable>true</ repeatable> 

<viewable>true</viewable> 

<label lang= ,, eng">Subj ect</label> 

</ f ield> 

<f ile> 

<name>PHOTOHQ</ name> 

<type> image/ tif f </ type> 

<label lang= M eng M >High quality photo 
</label> 

</f ile> 

Fig. 2. Data object template from Folklore Photographs sub-collection 

Composite Objects 

The folklore notebook is represented as a set of data objects that belong to three dif- 
ferent types: main, chapter and page. Every type corresponds to a data object tem- 
plate, which is defined on the appropriate collection. Type is defined in the c ontent- 
ModellD field of the digital object in the form collection _pid.object_type (i.e. 
uoadl: 10. chapter) and it is pointing to the specific data object template. Main and 
chapter objects contain descriptive metadata specific to the notebook and the chapter 
respectively, together with structural metadata. Page objects contain the digital con- 
tent (the page image in different formats) without descriptive metadata, together with 
structural metadata. Descriptive metadata, structural metadata and digital content are 
implemented in Fedora as separate datastreams of a data object. 

The objects management module creates data objects of specific types based on the 
templates defined on the collection. The basic methods provided are: retrieve template 
guidelines, edit metadata, add files, create disseminators and relate data objects using 
the appropriate relations. All these actions are restricted from the guidelines given by 
the data object template. 

Digital Object Relations 

Relations are necessary to represent the structure of composite objects or to relate 
independent data objects belonging to the same or different collections. To implement 
relations, a special datastream is used on every data object. This datastream contains 
structure metadata, of the form: 

<relation type= ' relation_type' >pid</relation> 

The value ‘ relation _type ’ denotes the type of the relation between the current data 
object and the one with the specified pid. The permitted relations for a data object are 
specified at collection level. The meaning of every relation is defined by the data 
object’s disseminators. The relation types for the notebooks collection digital objects 
are: ‘previous’ and ‘next’ to navigate between pages, ‘ chapter ’ and ‘main’ to define 
current data object’s chapter and main object, and ‘photo’ to retrieve the photograph 
attached to the current page object. To extend navigation functionality in a notebook. 



<f ile> 

<name>THUMB</naine> 

<type>image/ jpeg</type> 

<label lang= ,, eng">Photo thumbnail 
</label> 

</f ile> 

<relation> 

<name>page< /name> 

<label lang= " eng ">Not ebook page 
</label> 

<target_type>uoadl : 10 .page 
</ targe t_type> 

</relation> 

<disseminator> 

<name>goToPage</name> 

<label lang= ,, eng">View notebook page 
</label> 

<datastream>STRUCT</ datastream> 

<bdef >uoadl : 20</bdef > 

<bmech>uoadl : 21</bmech> 
</disseminator> 

</dobj> 
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a table of contents is used in each main digital object. This table is generated from the 
structural metadata of every data object connected to the specified notebook, by the 
‘main’ relation. Table of contents is represented in main digital objects as an XML 
content datastream. 

Although standard relation metadata can be used, such as Dublin Core relation 
element refinements (i.e. isParentOf, isReference, etc), a local metadata set was used, 
since it is more helpful to manage the structural metadata in a separate datastream, 
and more flexible to define specific relations depending on the collection needs, than 
using standard relation types. For example, the ‘next’ and ‘previous’ types have a 
special meaning in the notebook collection, helping the user to navigate between 
pages. In order to support interoperability with other digital repositories, we use a 
mapping of these metadata to DC element refinements. In order to facilitate the de- 
fined relations in the presentation level, specific web services are implemented and 
associated with behavior objects. Thus, end-user service communicates directly with 
Fedora Access API, in order to retrieve and use the appropriate behavior methods. 

Extended Metadata Support 

Every metadata model can be described in one or more datastreams of the digital 
object. Fedora supports indexing and searching methods, only for Dublin Core meta- 
data element set ( ‘DC’ datastream) and digital object’s system metadata. An external 
indexing application supports indexing and searching of other metadata sets using 
Fedora’s management API. Separate indexes must be created for every collection. 
Two open-source indexing applications that may be suitable for this purpose are Ja- 
karta Lucene [9] and Apache Xindice [1 ]. 



6 Summary 

Athens University must support an integrated Digital Library framework for multiple 
heterogeneous digital collection development. The most important requirements im- 
posed by the University’s collections were discussed in this paper, along with the 
description of the Folklore Collection. In order to evaluate available Digital Library 
systems, the Folklore Collection was chosen as a guide. A prototype implementation 
of Folklore digital Collection was developed using Fedora and DSpace. Based on the 
comparison results, it was decided to develop the integrated Digital Library platform 
on Fedora, accompanied with a proposed extension that better fits Athens Univer- 
sity’s special needs. While DSpace provides an enhanced “out of the box” solution to 
develop institutional collections that include digital material, Fedora’s architecture is 
preferred due to its extensibility. 
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Abstract. The DSpace™ digital repository system was released as open source 
software in November of 2002. In the year since then it has been adopted by a 
large number of research universities and other organizations world-wide that 
need a digital repository solution for a number of content types: research arti- 
cles, gray literature, e-theses, cultural materials, scientific datasets, institutional 
records, educational materials, and more. The DSpace platform and its various 
applications are becoming better understood with experience and time. As one 
result of a recent meeting of the DSpace user community, we are now venturing 
into the territory of broad, community-based open source development and 
management, and gaining insights from the experience of the Apache Founda- 
tion, Global Grid Forum, and other successful open source projects about how 
to build open source software for the digital library domain. 



Introduction 

DSpace™ is a free, open source software platform for building repositories of digital 
assets, with a focus on simple access to these assets, as well as their long-term preser- 
vation (to help ensure access over very long time frames) [1], It was originally de- 
signed with a particular service model in mind: that of institutional repositories of 
research material, and particularly research articles, which are produced by academic 
research institutions [2]. The idea was that institutions of all kinds could and should 
accept stewardship responsibility for their intellectual research output, for its wide- 
spread and long-term access. This is related to, but not synonymous with, the Open 
Access movement 1 , since while many of the institutions using DSpace have free ac- 
cess to their assets as a goal, the platform itself does not assume that assets it stores 
will be made available for free. 

DSpace was originally designed by developers at the MIT Libraries and HP Labs 
to be a breadth-first system with functionality to capture, describe, store, and preserve 
digital content, which adopters could download and install with minimal configura- 
tion and customization [3]. This decision was made for two reasons: to test the value 
of archivally-oriented digital asset management systems to the research university 
community without the need for extensive technical development, and to get a system 
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out to the open source development community that was “good enough” to get things 
going and foster wider debate about the many technology choices involved. 

Since its launch as an open source project in November of 2002, DSpace has un- 
dergone widespread adoption in several communities, and is starting to undergo ac- 
tive development by an open source developer community. This process of going 
from research to a public production release 1.0 on SourceForge, and then to a plat- 
form that is being developed by a large group of software developers representing 
both the original target audience and others who were not foreseen is an interesting 
story. It is our belief that the academic research community who often create open 
source projects for very good reasons don’t necessarily understand the implications of 
the open source model or the long-term issues it raises. Our experience with DSpace 
is both atypical of many of the successful open source projects and also instructive to 
other research projects with a goal of becoming successful open source projects as 
their long-term business plan. 

As a research project, it was the goal of the MIT Libraries or HP neither to produc- 
tize DSpace, nor to continue to provide sole support and development of the platform 
going forward. Both organizations continue to work on the platform, in different areas 
and for different reasons, and we are committed to making sure that the platform has a 
viable and sustainable model for its ongoing development and adoption. That means 
ceding a large degree of control in order to gain the long-term vision of a self- 
sustained tool that we can all leverage to our best advantage. 

This article attempts to provide enough context for DSpace to explain its origin and 
goals, to report on what has happened during the first year of its life as an open source 
project, and to attempt to divine the future of its transition to the next phase. 



Background 

The DSpace project was born out of a need voiced by faculty to the MIT Libraries to 
create a scalable digital archive that preserves and communicates the intellectual out- 
put of MIT’s faculty and researchers. At the Institute, there is a growing body of digi- 
tally born materials representing significant intellectual assets that require steward- 
ship. In addition to the more traditional text-based research output such as preprints 
and working papers, these assets include audio files, videos, datasets, software simu- 
lations and more. Faculty members often post their work on personal or departmental 
websites, but increasingly have become concerned about the sustainability of that 
solution. DSpace offers faculty and researchers a professionally managed archive that 
allows easy accessibility to their scholarly work. 

Recognizing that the problems DSpace seeks to address are not unique to MIT, the 
MIT Libraries and HP Labs envisioned a federated repository based on a common set 
of institutional repository standards for interoperability. Interoperability would make 
available the collective intellectual resources of the world’s leading research institu- 
tions. Further, we opted from the beginning to make the software entirely open source 
with the hope that a community of users and developers would emerge beyond the 
original MIT and HP team to contribute to the maintenance and enhancement of the 
code base over the long term. 
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The DSpace Federation 

In January 2003, the MIT Libraries embarked on a project funded by the Andrew W. 
Mellon Foundation to work with seven other research universities to begin the process 
of building a collaborative federation of institutions running DSpace. Each of the 
seven universities installed DSpace and tested the adaptability of the system to their 
university environment. Our goal was to learn from these implementations and to 
share lessons learned with a wider DSpace Federation. 

From the time the system was released as open source in November of 2002, up- 
take of the system has extended well beyond the original Federation project partners. 
DSpace has been adopted by a large number of research universities and other organi- 
zations that need a digital repository solution for a number of content types. These 
universities have evaluated DSpace’ s functionality and are further developing it to 
meet their needs. As the moment the software has been downloaded nearly 10,000 
times; over 125 universities are investigating it for use in their university environ- 
ment; and at least 20 universities are running production DSpace systems. 

With interest in and use of DSpace mounting far more quickly than was originally 
anticipated, the set of institutions participating in the DSpace Federation project made 
the strategic decision to expand the final project meeting to include all institutions 
currently using DSpace and shift the purpose of the gathering to an open user group 
meeting, which was held on March 10-11, 2004. Approximately 120 people attended 
the sold-out meeting, representing 50 institutions, including universities, government 
agencies, and corporations, from 10 different countries. Members of the user commu- 
nity shared their DSpace experiences and plans, through which we learned that the 
DSpace platform is being put to a variety of uses: primarily to create institutional 
repositories of research publications and other material, but also for other applications 
(e-thesis repositories, learning object repositories, e-journal publishing, cultural mate- 
rial collections, electronic records management, and so on). 

Within the UK, we already are beginning to see the diversity of purposes to which 
DSpace can be applied. The DSpace @ Cambridge project, a joint collaboration be- 
tween Cambridge University Library and MIT Libraries, aims to implement an insti- 
tutional repository for scholarly research, but also is exploring the use of DSpace for 
administrative records and learning objects. Edinburgh University chose the DSpace 
platform for its Theses Alive! project, which aims to produce an OAI-compliant re- 
pository for the creation and management of e-theses and pilot it as a national service. 
Programmers at Edinburgh have developed an add-on module for DSpace that in- 
cludes a supervised workspace for theses creation, supervision administrative tools, 
and a submission system for theses metadata collection. Glasgow’s DAEDALUS 
project is piloting several open source institutional repository solutions and has opted 
to deliver a range of distinct open access services supported by complementary soft- 
ware platforms (one of which is DSpace) that optimally meet Glasgow University’s 
needs for specific collection and digital content types. For the international commu- 
nity, it is also relevant to note the work done by the Universite de Montreal’s Erudit 
project to translate DSpace into French, work that has provided important lessons for 
customizing DSpace for local language. Other institutions in non-English speaking 
countries are now working to translate the system into local languages, and have iden- 
tified general internationalization as an important goal for the future. 
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DSpace, and institutional repositories in general, are proving to be a high-value, 
long-term vision, but are still very much works in progress. Universities are setting 
their own policies to define what an institutional repository service means in the con- 
text of their university environment. Seemingly straightforward questions such as 
what types of file formats or content will be accepted and who is authorized to submit 
materials to DSpace quickly become complex when long-term implications for digital 
preservation and stewardship are considered. 

Building collections of digital content, particularly scholarly research content, has 
proven to be another challenge universities consistently grapple with when imple- 
menting institutional repositories. Many of the DSpace projects around the world are 
grant funded or have limited resources and are under pressure to prove the value of 
the service, often measured (rather simplistically) through the number of items in the 
repository. DSpace was designed with a decentralized web submission interface that 
allows research communities to contribute their own items and metadata. This para- 
digm shift has been a novel and attractive aspect of the service but has meant that 
library staff has had to become proficient marketers, carefully positioning the service 
to meet user needs. Publicity and promotional activities help raise initial awareness 
among potential users but targeted communications with highly tailored marketing 
messages often are what persuade them to become submitters. 



Open Platform - First Steps 

The DSpace software released as version 1.0 into open source embodied use-cases 
derived from an analysis of needs within the MIT scholarly community viewed 
through the lens of the library. Yet this begged an important question: to what extent 
did these use-cases reflect the needs of institutional repositories generally? Rather 
than undertake a systematic survey or study, the expectation was that those who 
evaluated or adopted the software would provide an answer in the form of reworking 
the software itself to suit local purpose. The evolution of the DSpace platform would 
then consist of a rational assimilation of this work into the centrally managed code 
repository. Our biggest concern was the possibility of fragmentation or ‘centrifugal’ 
dissipation: that the platform would be pulled in too many directions, asked to do too 
many things, so that none could be done well. To prevent this, procedures were insti- 
tuted to subject proposed contributions to a closely managed review process. Those of 
sufficient technical merit and deemed consistent with the vision of DSpace would be 
incorporated; the rest would reside as localizations of the platform outside its man- 
agement. 

The first year produced relatively few contributions, given the size and interest 
level of the adopter community. This was not due to a shortage of ideas, however: the 
mail lists and other forums were filled with use-cases and other expressions of need 
exceeding the 1.0 platform capability. Analysis of this situation revealed several fac- 
tors at work: ( 1 ) The process of adopting DSpace could be lengthy and involved, and 
technical rework was often put behind such tasks as formulating a sustainable busi- 
ness model, developing service guidelines, or building awareness and buy-in from 
depositors. This had the effect of pushing software development considerations out of 
an early time frame. (2) Many of the potential adopter institutions lacked the technical 
resources required to undertake significant software development. (3) Architectural 
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limitations in the implementation of the platform made certain kinds of modification 
difficult to do. (4) Perhaps most interesting, however, was the perception that the 
platform, although distributed freely in source code form, was an immutable offering, 
much like commercial software product offering. There are many reasons why this 
perception took root, including the fact that its initial development cycle was ‘closed’, 
and that in order to build awareness of the platform it was ‘branded’ as an MIT/HP- 
sponsored effort, rather than an outgrowth of a community-driven process. 

To address this perception, DSpace development was deliberately steered in the di- 
rection of the needs of the nascent community of users. The functional requirements 
of the next major release of the platform, 1.2 (1.1 basically represented the comple- 
tion of the original research project agenda) were culled from postings to the DSpace 
lists, and from other discussions and surveys eliciting adopter feedback. In this way 
we hoped both to realize and to convey the community-centric nature of the DSpace 
platform. And to the degree that this additional functionality will remove barriers to 
adoption, the plan is proving successful. Yet since the bulk of the development effort 
was still concentrated within MIT/HP, it also is having the opposite effect - that of 
reinforcing the vendor/consumer dichotomy it was intended to overcome. 

On the technical architecture front, the analysis of limitations has produced a 
roadmap for a new design direction, DSpace 2.0, which will address several key 
shortcomings of the current architecture: (1) Functional modularity coupled with the 
use of stable, well-defined APIs for their use will promote the development of inde- 
pendent implementations by DSpace adopters. This will substantially alter the concept 
of DSpace as a closed body of code, replacing it with the concept of a software 
framework, within which myriad implementations may coexist. (2) A refactoring of 
the presentation layer will enable much simpler alteration of UI without complications 
elsewhere in the code. (3) A much cleaner representation of content and associated 
metadata as a self-contained archival information package (AIP) will facilitate inter- 
operability and maintenance of a DSpace repository. 



From Code to Community 

One important lesson we learned was this: to build an open source community, it is 
insufficient merely to publish a body of code as open source, even on commercially- 
friendly licensing terms (BSD[4j), and wait for a community to coalesce. Achieving 
true community requires the transformation of users who are initially consumers into 
stakeholders. We are examining several successful open source initiatives, such as the 
Apache Software Foundation, the Global Grid Forum, and the Eclipse Foundation, 
and, together with the user community, are formulating a plan for the DSpace plat- 
form. Among short-term objectives are: (1) expansion of the core set of developers to 
include those outside the initial circle of researchers. (2) Articulation of a clear proc- 
ess to encourage further enlargement of the developers’ group. In most open source 
models, the existing group invites new developers, and functions as a project man- 
agement board. (3) Recruitment of contributors to the platform on many other levels, 
from requirements definition to documentation, testing - indeed all aspects of plat- 
form maintenance and evolution. (4) Improved communication channels. Two goals 
are involved here: first, to produce greater transparency in the process of platform 
development we will need better ways to expose the deliberative steps involved. A 
developer-focused mailing list is one frequently adopted technique to achieve this. 
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Second, there need to be more flexible and accessible opportunities to become in- 
volved in development issues. Wikis and other semi-structured discussion tools can 
serve this purpose. 

In the longer term, it will be important to establish or join forces with an independ- 
ent not-for-profit entity (e.g. a 501(c)(3) corporation [5]) to be charged with steward- 
ship of the software, and to possibly assume ownership of the intellectual property 
(copyright, trademark, license, etc.). Issues of financial support and governance mod- 
els will be foremost in choosing a model - e.g. does financial contribution confer 
special privileges with respect to platform development? We hope to address these 
issues carefully, while proceeding quickly on the short-term agenda. Throughout this 
process, what is paramount to communicate to the greater body of adopters is that the 
continued evolution and in fact the very existence of the platform will depend upon a 
collective effort, not on the beneficence of the founding institutions. 



Conclusion 

As stated earlier, while it is not the current aim of the MIT Libraries or HP to build a 
commercial product with DSpace, neither was it our aim to prevent that from happen- 
ing at all. We wanted to understand what it would take to build a useful digital archi- 
val repository: to test the technologies involved, to have a platform to explore service 
models like institutional repositories, and to have a platform for ongoing research in 
important areas such as digital preservation, Semantic Web techniques for metadata 
management, persistent identification schemes, and open access-friendly DRM sys- 
tems. 

In order to achieve our goals for the DSpace platform it is vital that it become a 
successful open source project with an active community of developers far beyond 
MIT or HP. That can only happen if the platform is useful to a critical mass of organi- 
zations that can provide the resources to do this work. We also expect that develop- 
ment of the platform will reveal a range of necessary standards - for interoperability, 
for rights managements, for identification of content and the people accessing it, for 
content discovery and preservation, for the metadata to support all of this, and more. 
The future DSpace Federation organizational home will provide the governance to 
make sure that everyone’s goals for the platform are met, and hopefully to foster its 
adoption by a range of organizations in many sectors. The research community who 
we represent is but one potential adopter of this technology, and we believe that by 
leveraging the expertise and resources of other sectors, ours will ultimately benefit in 
ways that have proved elusive in the past. 

The promise of open source for projects like DSpace to the digital library commu- 
nity are obvious, if it’s successful. But there are some barriers to success. Many insti- 
tutions lack the resources to deal with complex applications like DSpace on a techni- 
cal level - they require support to install, configure and customize it for their local 
needs, and to maintain it over time. The open source world, with a few exceptions 
(most notably Red Hat for the LINUX operating system), doesn’t provide models for 
such support and assumes that adopters have the necessary local expertise. The sec- 
ond barrier to success is in sustaining the developer community that will ensure the 
platform’s continued usefulness over time. The research library community, who have 
been the primary adopters of the DSpace platform so far, do not, by themselves, have 
the resources themselves to sustain DSpace indefinitely. They have technical exper- 
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tise, to be sure, but it is typically over-stretched. They often cannot dedicate pro- 
grammers to work on an open source platform without external support (usually for a 
grant-funded project of a year or two). It is possible that DSpace could survive on that 
basis, but risky. If the digital library community can share control of the platform with 
other sectors, particularly commercial and governmental sectors, then many more 
resources can be brought to bear to the problem. 

So why are research libraries motivated to get involved in open source projects like 
DSpace? To learn more about how it really works, to better fulfill their mission, be- 
cause commercial offerings are too expensive and often inadequate. Why are research 
libraries not getting involved? There is a noticeable tendency among managers to treat 
open source software as if it was commercially supplied. The owning organization is 
the '‘vendor” and the “products” can be comparatively evaluated and judged good or 
bad accordingly. The problem with this approach is that where open source software 
is concerned we are all, collectively, “the vendor”. Or rather, there is no vendor to 
negotiate with, and if the product doesn’t meet local needs then it can be made to do 
so. There is a corresponding tendency among library adopters of open source software 
to feel faint obligation back to its source - the software is just a product that happens 
to be free. But open source software certainly does cost its adopters something: the 
staff time to configure and maintain it without a formal support contract (typically), 
and the more nebulous moral obligation to provide some value in return for this free 
good. If open source software works at all then it’s because those who benefit from it 
also contribute to it in some way: functionally, technically, or monetarily. Our com- 
munity has much to learn about ways in which is can contribute to these efforts other 
than as grateful, but silent, adopters. 

We have looked for inspiration to existing open source projects and organizations: 
obviously LINUX and the Apache Foundation, but also the Global Grid Forum and 
CNRI. Each of these organizations has some model for sustaining open source soft- 
ware but they’re all different. Undoubtedly there are many others that we have not yet 
had time to identify and investigate. Which one is the most relevant to applications 
like DSpace? Which to the communities that created it? Which to the communities 
who are now adopting and improving it? Clearly there are many, many issues still to 
be addressed, and we hope that our experience in some way informs the understand- 
ing of the open source promise to and contract with the digital library community. 
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Abstract. This paper presents results from an evaluation of algorithms 
for ranking results by probability of relevance for Geographic Informa- 
tion Retrieval (GIR) applications. We review the work done on GIR and 
especially on ranking algorithms for GIR. We evaluate these algorithms 
using a test collection of 2500 metadata records from a geographic digital 
library. We present an algorithm for GIR ranking based on logistic re- 
gression from samples of the test collection. We also examine the effects 
of different representations of the geographic regions being searched, in- 
cluding minimum bounding rectangles, and convex hulls. 



1 Introduction 

Geographic data are an extremely important resource for a wide range of sci- 
entists, planners, policy makers, and analysts who study natural and planned 
environments. Notably, the landscape of geographic analysis has been changing 
rapidly from data and computation poor to data and computation rich [15]. De- 
velopments in digital electronic technologies, such as satellites, integrated GPS 
units, digital cameras, and miniature sensors, are dramatically increasing the 
types and amounts of digitally available raw geographic data and derived in- 
formation products [17]. At the same time, advances in computer hardware, 
software and network technologies continue to improve our ability to store and 
analyze these large, complex data sets. 

These factors are contributing to a growing political, social, scientific and 
economic awareness of the value of geographic information and driving new ap- 
plications for its use. In response to this, geographic digital libraries are growing 
in number, collection size, and sophistication. Moreover, mainstream digital li- 
braries, i.e. those that deal with primarily text materials, are increasingly consid- 
ering geographic access methods for information resources that have important 
geographic characteristics. Simply stated, most of the objects in digital libraries 
are, to a greater or lesser extent, about, or related to, particular places on or 
near the surface of the Earth. 
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One common approach in digital libraries is to use place names as a geograph- 
ical search surrogate. However, place names have well-documented lexical and 
geographical problems [13]. Lexical problems include lack of uniqueness, variant 
names or spellings, and name changes. Geographical problems include bound- 
aries that change over time and geographic features or areas without known place 
names. Geographic coordinates, on the other hand, provide an unambiguous and 
persistent method for locating geographic areas or features. However, the use of 
coordinates presents many challenges in terms of storage, indexing, processing 
and user interface design that only recently have begun to be investigated in the 
context of geographic information retrieval (GIR) for digital libraries. 

One key question for GIR is what level of detail should be used to encode co- 
ordinate information? Gazetteer research cautions that complex spatial objects 
present enormous data storage and performance problems for online geographic 
digital libraries [1,11], which provide, at best, extremely limited GIS function- 
ality. The decomposition of complex spatial objects into approximate represen- 
tations is a common approach to simplifying coordinate representations. Early 
work in this area by Hill [10] suggests that minimum bounding rectangles can 
sufficiently represent geographic objects for information retrieval applications. 
Other research [1, 12] indicates that even single point representations can be 
used effectively when combined with innovative retrieval and ranking methods. 

In this paper we explore these issues and present some new algorithms for 
ranked retrieval of georeferenced objects in digital library collections. We discuss 
the characteristics of georeferenced information and its use in digital libraries. 
The next section describes the primary components for GIR within digital li- 
braries and describes the characteristics of GIR in a digital library context. 
Subsequent sections examine indexing and access creation for geo-referenced 
sources. We then examine the retrieval effectiveness of several GIR algorithms 
using a test collection of geospatial metadata from the California Environmental 
Information Catalog (CEIC http://ceres.ca.gov/catalog). 



2 Geospatial Metadata 

Geographic digital libraries typically use geospatial metadata to provide surro- 
gate representations of geographic resources that encode the structure and con- 
tent of digital geographic data to support identification, discovery, evaluation, 
and understanding. This metadata is vital for most geographic data because, 
as non-textual, abstract representations of complex phenomena, they cannot be 
effectively and appropriately used without it. 

Geospatial metadata specifically addresses the encoding of coordinate rep- 
resentations of geographic objects. There are geospatial metadata standards in 
most EU countries. In the U.S., it is usually created in accordance with one of 
two metadata standards: 1) the Dublin Core (DC) [6]; or 2) the Federal Geo- 
graphic Data Committee’s Content Standard for Digital Geospatial Metadata 
(FGDC) [8]. The only geographic element in the base DC is the Coverage el- 
ement. This element can be used to specify a place name, place code (e.g. zip 
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Fig. 1 . Minimum Bounding Rectangles (thin lines) and Convex Hulls for the State of 
California and the City of San Jose. 

code), or the geospatial coordinates of a point, bounding rectangle, or irregular 
polygon that locates the resource being described. 

The FGDC Standard was created specifically to describe digital geospatial 
data, but is also applied to paper maps, air photos, atlases, environmental im- 
pact statements, and other geographically related materials. It provides elements 
that address the geospatial characteristics of the data, including: Spatial domain 
(geographic coordinates defining the data’s extent), Place names (qualitative 
descriptors of the geographic extent), Spatial reference system (projection and 
coordinate system information, Spatial Representation model (vector, raster), 
Spatial features (type and quantity) and Spatial data quality (accuracy, com- 
pleteness, lineage, and sources). The FGDC Standard requires only a coordinate 
pair defining a Minimum Bounding Rectangle (MBR) for the object, but allows 
more complex descriptions also. 

As can be seen in Figure 1, MBRs provide a compressed, abstract approxi- 
mation of a spatial object. The representation is conceptually powerful because 
it evokes a printed map. Its simplicity, computational efficiency, and storage ad- 
vantages make it the most commonly used spatial approximation [4]. Yet, the 
MBR has obvious weaknesses when representing diagonal, irregular, non-convex, 
or multi-part regions [18]. MBRs over-estimate area, misrepresent shape, and fail 
to capture the distribution of the data within themselves, leading to “false pos- 
itives” in GIR matching. 

3 Geospatial Search and Ranking Methods 

Other spatial approximations, such as the minimum bounding ellipse, minimum 
bounding N-corner convex polygon, and convex hull, have been investigated in 
the context of spatial databases and GIS applications, but not for GIR, where the 
MBR still represents the state of the art. In searching, a query region representing 
the user’s area of interest may be defined by 1) Entering geographic coordinates 
for a point or bounding box, 2) Using a graphical map interface to zoom in to, 
click on, or draw a polygon, typically a bounding box, around the area of interest 
and 3) Entering a place name or selecting it from a list. 
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Table 1. Methods for computing spatial similarity. 



Reference Formula 


Hill, 1990 [10] 
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Walker et al, 1992 [19] 
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Beard and Sharma, 1997 [3] 
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Fig. 2. Spatial relationships between overlapping regions. 



The first two methods result in the delineation of a coordinate-based query 
region. The third uses a digital gazetteer to obtain coordinate representations for 
named places. Regardless of the method used, a query region is often represented 
internally as a simple bounding rectangle [11]. For geospatial searches, the query 
region is compared with MBRs of all candidate geographic information objects 
(GIOs) in the digital library using polygon-polygon geometric operations. If 
there is overlap between the query and the GIO regions, the GIO is considered 
a match. Possible relationships between two overlapping regions are illustrated 
in Figure 2. This is a simplified subset of the 9 intersection topological model 
for spatial relations [7]. Proximity relationships (such as near or adjacent) are 
not considered matches. 

GIR ranking methods are based on quantifying the similarity between the 
query and a GIO in the collection. This similarity “score” can be interpreted as 
an estimate of the relevance, or utility, of a candidate GIO for a user’s informa- 
tion need. Retrieved items are ranked and presented to the user in descending 
order of these scores. While traditional IR scores and rankings are based on the 
statistical properties of terms in a collection, GIR relies on spatial scores and 
rankings based on geospatial characteristics such as size, shape, location, and 
distance. There are three basic approaches to spatial similarity measures and 
ranking: 
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Method 1: Simple Overlap. Candidate geographic information objects (or 
GIOs) that have any overlap with the query region are retrieved. 

Method 2: Topological Overlap. Spatial searches are constrained to only 
those candidate GIOs that: a) are completely contained within, b) overlap, 
or c) contain the query region. Each category is exclusive and all retrieved 
items are considered relevant. 

Method 3: Extent of Overlap. A spatial similarity score is derived from the 
extent of overlap between a candidate GIO and the query region. The greater 
the overlap, the greater the assumed relevance of the candidate GIO to 
the query. A variety of spatial scores based on overlap are discussed in the 
literature (Hill, 1990; Walker et al, 1992; Beard and Sharma, 1997) and 
presented in Table 1. 

The simple and topological overlap approaches are most commonly used in dig- 
ital libraries where the geographic objects of interest are represented by MBRs. 
Retrieval algorithms based on MBRs are easy to implement and are supported 
by the GEO profile of the Z39.50 information retrieval protocol [16]. However, 
the Boolean matching criterion does not allow for spatial ranking and thus in- 
hibits good retrieval performance [2] (p. 26), especially as result sets grow in 
size. Classifying retrieved candidates based on topological relationships (e.g., 
contains, overlaps, contained within), as in method 2, is a first step in discrimi- 
nating among the results, but it doesn’t speak directly to the issue of relevance. 
Moreover, the burden is on the user to understand these relationships and how 
they impact a geospatial search. There has been very limited research on the 
effectiveness of spatial ranking with Hill [10] presenting the only empirical data 
and evaluation. 



3.1 Probabilistic Spatial Ranking 

Maron and Kuhns [14] first introduced the idea that, given the imprecise and 
incomplete ways in which a user’s information need is represented by a query 
and an information object by its indexing, relevance should be approached prob- 
abilistically. This is especially true for geographic information retrieval since all 
geographic information objects are abstract, compressed representations of real 
world phenomena that contain some degree of error and uncertainty [9] . 

In the Logistic Regression (LR) model of IR [5], the estimated probabil- 
ity of relevance for a particular query and a particular record in the database 
P(R | Q,D) is calculated as the “log odds” of relevance log 0(R \ Q,D) and 
converted from odds to a probability. The LR model provides estimates for a 
set of coefficients, Cj, associated with a set of S statistics, A,;, derived from the 
query and database, such that: 



s 

log 0(R \Q,D) = c 0 + ^2 °i x i 

i= 1 



(i) 
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Fig. 3. Search Query (dashed rectangle) and MBRs and Polygon Representations of 
Marin (NW) and Stanislaus (E) Counties. 



where Co is the intercept term of the regression. The spatial ranking, or proba- 
bility of relevance, can then be given as: 



_log 0(R\Q,D) 

P(R I Q, a) - , + e k,go(fl|Q,p) (2) 

For this study, the geospatial characterics, i.e. explanatory statistics or feature 
variables, explored in the logistic regression model are: 

X\ = area of overlap(query region, candidate GIO) / area of query region 
X 2 = area of overlap(query region, candidate GIO) / area of candidate GIO 
X 3 = 1 abstraction of query region that is onshore - fraction of candidate 
GIO that is onshore) 

Like the spatial similarity measures presented in Table 1, X\ and X 2 are based 
on the extent of the area of overlap and non-overlap between the query and 
candidate GIO regions. X 3 requires a bit more explanation. As noted in Hill 
[10] geographic areas that are near a coastline can be problematic when approx- 
imated by simplified geometries like the MBR. The MBR for an offshore region 
may necessarily include a lot of onshore area, and vice versa. We define A3 as 
a “slrorefactor” variable that captures the similarity between the fraction of a 
query region that is onshore compared to that of a candidate GIO region. For 
example, if a query region is 20% onshore and a candidate GIO region is 75% 
on shore, then the slrorefactor is 1 — abs (. 20 — .75) = .45. Calculating shorefac- 
tor is illustrated in Figure 3. Marin County is 70% onshore, while Stanislaus 
County is 100% onshore. The dashed query box in Figure 3 is 45% onshore. 
Thus, the slrorefactor for Marin is 1 — abs {. 45 — .70) = .75 while for Stanislaus 
it is 1 — abs (. 45 — 1) = .45. A slrorefactor of 1 indicates that both regions are 
either offshore or onshore. A slrorefactor approaching 0 indicates that one region 
is almost completely onshore and one is almost completely offshore, thus it al- 
lows geographic context to be integrated into the spatial ranking process. The 
slrorefactor was computed by intersecting both the query and GIO regions with 
a very generalized polygonal representation of the Western USA. 
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4 Evaluation Approach 

We applied our logistic regression method and the three spatial ranking methods 
presented in Table 1 to a test collection of geospatial metadata. The research 
questions were: 1) How effectively can the geographic relationship between a 
query region and the region of a candidate information object be evaluated and 
ranked based on the overlap of the geographic approximations of these regions? 
and 2) How do different geographic approximations affect the rankings? To exam- 
ine question one, the results of the different ranking methods were summarized 
and compared using two standard retrieval performance evaluation measures: 
average precision at 11 standard recall levels and the mean average query preci- 
sion. 

In response to the second question, we applied and compared the results of 
these ranking methods for both MBR and convex hull approximations of the 
candidate GIOs. The convex hull is the minimum convex polygon that contains 
a geometric object (i.e. collection of points). It can be visualized as a rubber 
band around a geographic polygon to approximate its extent (see heavy lines 
in Figure 1). The convex hull is widely used as a geometric approximation in 
GIS and it provides the best approximation quality of conservative (i.e., encloses 
all points of the original) convex representations [4] . Because a convex hull is a 
better approximation of the original spatial object than an MBR, it will retrieve 
fewer false positives when used for GIR. 

We assumed that candidate GIO regions that overlap the query region are 
relevant and regions that do not overlap are not relevant. Given that all regions 
are represented by conservative approximations, all relevant items will be re- 
trieved (i.e. 100% recall). However, not all approximations that overlap will be 
relevant because the regions they represent may not overlap [18]. 

4.1 Test Collection Overview 

The test collection for this study was a subset of metadata records from the Cal- 
ifornia Environmental Information Catalog (CEIC), (http://ceres.ca.gov/ 
catalog). The CEIC collection includes a wide variety of different types of 
geographic information resources, including: vector and raster geospatial data, 
maps, databases, documents, reports, websites, models, etc. These resources are 
documented with metadata prepared in accordance with the FGDC standard. 
For this study, approximately 2500 metadata records in XML format were se- 
lected from the total collection of about 4000 (as of August 2003) . These records 
can be divided into two main categories: 1) those that refer to known, named 
geographic regions within the state; and 2) user defined areas (UDAs) - those 
regions that are specific to the person or organization that created the GIO de- 
scribed by the metadata. An important distinction between these two categories 
is that the geographic regions associated with the CA places are typical of those 
found in gazetteers and place name thesauri. Moreover, these regions can be 
traced, via their names, to geographic data containing more precise geographic 
representations, which we used in calculating the “shorefactor” described above. 
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For the UDA regions, which seldom have accurate or complete data available, we 
assume that both the convex hull and complex polygon representations of the 
geographic extent are equal to the MBR approximation. The MBRs and convex 
hulls were pre-processed in the ESRI Arc View software and then loaded into 
Postgres 7.4, with the PostGIS 0.8 and GEOS 1.0 extensions, where the analysis 
was done. 



5 Results 

This research considers the test collection in two parts. In the first part, the issues 
of spatial representation and ranking are considered for the metadata indexed 
by CA places. The second part considers the entire collection, both CA places 
and UDAs. 

Our first set of tests considered only the 2072 test metadata records, or 
GIOs, that were indexed geospatially by known CA places. The 42 CA counties 
referenced in these GIOs were used, each in turn, as query regions. MBR and 
convex hull approximations of all CA places referenced by these metadata were 
treated as candidate GIO regions. 

The first task was to determine the reference set of candidate GIO regions 
relevant to each county query region. This was done using the complex polygon 
data to select all CA place regions that overlap, contain, or are contained within 
the query region. All retrieved regions were reviewed (semi-automatically) to 
remove sliver matches, i.e. those regions that only overlap due to inconsistencies 
in the data. This process resulted in a master file of CA place regions relevant 
to the 42 CA county query regions. Queries for ten county regions were used 
to train the logistic regression models. LR Equation 3 was used for the MBR 
rankings and LR Equation 4 for convex hulls: 

log 0{R | Q, L>) = -5.040 + 6.5154 • + 5.7729 • V 2 (3) 

log 0{R \Q,D) = -3.4767 + 7.4536 • Xi + 5.7569 * A 2 (4) 

Queries for the other 32 county query regions were run against the MBR and 
convex hull approximations of the candidate GIO regions. We then applied all 
four spatial ranking methods to the result sets and calculated precision-recall 
summary statistics. 

Tables 2 and 3 show the evaluation results of the four spatial ranking meth- 
ods on the CA places subset of the test collection. These tables show that: 1) the 
values for the non-logistic regression ranking methods are extremely similar; 2) 
the logistic regression method performed better than the other methods on this 
test collection; 3) for all methods, rankings based on the convex hull representa- 
tions performed better than those based on the MBR representations. Yet, it is 
interesting to note that the non-logistic regression spatial ranking methods ap- 
plied to the convex hull approximations do not perform better than the logistic 
regression method applied to the MBRs. An important implication of this is that 
it is worth investigating more effective spatial ranking methods before adopting 
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Table 2. Mean Average Query Precision for Named Places. 



Ranking method 


MBRs 


Convex Hulls 


Hill, 1990 


0.7193 


0.8097 


Walker, et al., 1992 


0.7025 


0.8006 


Beard and Sharma. 1997 


0.7094 


0.8116 


Logistic Regression 


0.9389 


0.9973 



Table 3. Average Precision at 11 Standard Recall Levels for Named Places using 
Minimum Bounding Rectangles and Convex Hulls. 



Mini 
Recall Level 


mum 1 
Hill 


Boundin 

Walker 


g Rect 
Beard 


Logistic 


Hill 


Conve 

W^alker 


x Hull 
Beard 


Logistic 


0.00 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


0.10 


0.8668 


0.8660 


0.8717 


0.9777 


0.9232 


0.9248 


0.9277 


1.0000 


0.20 


0.8409 


0.8362 


0.8430 


0.9663 


0.9152 


0.9049 


0.9083 


1.0000 


0.30 


0.8101 


0.8109 


0.8214 


0.9651 


0.8708 


0.8775 


0.8813 


1.0000 


0.40 


0.8017 


0.7985 


0.8073 


0.9651 


0.8705 


0.8669 


0.8746 


1.0000 


0.50 


0.7940 


0.7972 


0.8068 


0.9651 


0.8661 


0.8658 


0.8735 


1.0000 


0.60 


0.7919 


0.7951 


0.8039 


0.9651 


0.8660 


0.8658 


0.8735 


1.0000 


0.70 


0.7919 


0.7951 


0.8039 


0.9643 


0.8623 


0.8658 


0.8698 


0.9997 


0.80 


0.7919 


0.7951 


0.8039 


0.9520 


0.8623 


0.8658 


0.8698 


0.9983 


0.90 


0.7914 


0.7947 


0.8035 


0.9477 


0.8621 


0.8656 


0.8696 


0.9882 


1.00 


0.7613 


0.7684 


0.7881 


0.9114 


0.8243 


0.8274 


0.8291 


0.9648 


Avg Prec 


0.8220 


0.8234 


0.8321 


0.9618 


0.8839 


0.8846 


0.8888 


0.9955 



more complex spatial approximations. Average precision at 11 standard recall 
levels (Table 3) gives one an idea of how an algorithm performs over the course 
of retrieving all relevant GIOs. Mean average query precision is a measure that 
favors systems that rank relevant documents early in the results (Table 2). It 
averages precision values after each new relevant document is observed in the 
ranked list and presents a summary statistic of overall performance. However, it 
may not indicate if an algorithm has poor recall [2] (p. 80). But, these character- 
istics make it a good fit for spatial ranking algorithms because poor recall is very 
rarely an issue and high precision is desireable. Moreover, the metric is insensi- 
tive to differences in the number of items indexed to the same geographic region. 
This latter is not true of average precision at standard recall levels, therefore the 
values for average precision (Table 3) are lower than those for mean average 
query precision (Table 2). Interestingly, the difference between these values for 
the logistic regression method does not differ as much as for the non-logistic 
regression ranking methods. 

The second part of our study considers the test collection metadata as a 
whole: both those metadata indexed by CA places and those indexed by user- 
defined areas (UDAs) . As with the tests in Part I, the 42 CA counties referenced 
in the GIOs were considered query regions and the MBR and convex hull rep- 
resentations for all geospatially indexed areas were treated as candidate GIO 
regions. 

















































































































54 



Ray R. Larson and Patricia Frontiera 



Table 4. Mean Average Query Precision for Full Collection. 



Ranking Method 


MBRs 


Convex Hulls 


Ranking Method 


MBRs 


Convex Hulls 


Hill, 1990 


0.6722 


0.7936 


Walker et al., 1992 


0.6509 


0.7810 


Beard and Sharma, 1997 


0.6523 


0.7778 


Logistic Regression 1 


0.8141 


0.9099 


Logistic Regression 2 


0.8819 


0.9238 



The reference set of UDA regions relevant to each county query region was 
determined through a manual review of the UDA metadata. This process could 
not be automated because, unlike the CA place regions, there are no reference 
data sets of complex polygons that delineate the UDA regions. 

As in Part I, queries for ten county query regions were used to train the 
logistic regression models. Because 88% of the UDAs represent coastal or offshore 
regions, an additional logistic regression model was tested that includes the 
slrorefactor variable. LR equations 5 and 6 were used for MBRs and equations 
7 and 8 were used for convex hulls.: 

log 0{R | Q,D) = -1.6747 + 1.9871 • AR + 3.2976 • AT 2 (5) 

log 0{R | Q,D) = -2.1303 + 1.9138 • X 1 + 3.2157 • AT 2 + 0.7451 • AT 3 (6) 

log 0{R | Q,D) = -1.2123 + 1.4471 • X 1 + 5.4585 • AT 2 (7) 

log 0{R | Q,D) = -1.2825 + 1.4341 • AR + 5.4096 • AT 2 + 0.1267 • AT 3 (8) 

The evaluation results are presented in Tables 4 and 5. These show a similar pat- 
tern to the results presented in Part I. The logistic regression rankings perform 
better than non-logistic regression methods and the convex hull approximations 
also perform better than the MBRs. Again, the logistic regression rankings for 
MBRs perform as well as or better than the non-logistic regression rankings for 
convex hulls, although by a smaller margin than when just the CA place regions 
were considered. 

The addition of the UDA regions significantly degrades the retrieval perfor- 
mance for all algorithms, even though these regions only index 19% of the total 
metadata records. The majority of the UDA regions are for coastal or near- 
coastal offshore areas which, when approximated by either MBRs or convex hulls 
necessarily overlap with onshore regions, thus generating more false-positive re- 
trievals. The logistic regression model (LR2) that incorporates the slrorefactor 
variable is meant to address this problem, yet this method shows only a small 
(but significant) improvement over the other logistic regression model (LR1), 
especially for the convex hull approximations. T-tests for paired samples of LR1 
and LR2 results gave results ranging from -3.028 to -4.144 with 0.005 or less 
probability of random occurrence. 
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Table 5. Average Precision at 11 Standard Recall Levels for the Full Collection using 
Minimum Bounding Rectangles and Convex Hulls. 



Mi 

Recall Level 


nimurr 

Hill 


Bounc 

Walker 


ing Re 
Beard 


ct. 

LR1 


LR2 


Hill 


Con 

Walker 


vex Hi 
Beard 


ills 

LR1 


LR2 


0.00 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


0.10 


0.8384 


0.8453 


0.8562 


0.9174 


0.9529 


0.9099 


0.9146 


0.9188 


0.9715 


0.9781 


0.20 


0.7983 


0.7782 


0.7885 


0.9114 


0.9413 


0.8905 


0.8827 


0.8874 


0.9676 


0.9746 


0.30 


0.7575 


0.7729 


0.7871 




0.9395 


0.8484 


0.8634 


0.8686 


0.9560 


0.9641 


0.40 


0.7402 


0.7460 


0.7570 


0.8785 


0.9310 


0.8428 


0.8482 


0.8585 


0.9534 


0.9635 


0.50 


0.7377 


0.7450 


0.7570 


0.8767 


0.9291 


0.8406 


0.8481 


0.8583 


0.9534 


0.9625 


0.60 


0.7350 


0.7420 


0.7538 


0.8742 


0.9291 


0.8406 


0.8481 


0.8583 


0.9534 


0.9625 


0.70 


0.7350 


0.7420 


0.7538 


0.8742 


0.9291 


0.8403 


0.8481 


0.8548 


0.9505 


0.9579 


0.80 


0.7350 


0.7420 


0.7538 


0.8631 


0.9182 


0.8371 


0.8481 


0.8548 


0.9412 


0.9539 


0.90 


0.7344 


0.7416 


0.7534 


0.8631 


0.9018 


0.8312 


0.8478 


0.8544 


0.9342 


0.9432 


1.00 


0.7067 


0.7139 


0.7311 


0.7715 


0.7743 


0.7819 


0.7782 


0.7787 


0.8340 


0.8272 


Avg Prec 


0.7744 


0.7790 


0.7902 


0.8836 


0.9224 


0.8603 


0.8661 


0.8721 


0.9468 


0.9534 



6 Conclusions 

In GIS and spatial database technologies, geometric approximations, primarily 
the MBR, are used as a first step to filter possible matches. Then, a refinement 
step examines the actual complex spatial objects to determine the final result 
set. However, in a geographic digital library environment, the end-user is the 
refinement step. For this reason, both high-quality approximations that limit 
the number of false matches and spatial ranking strategies that present best 
matches first are extremely important in GIR. We have shown that a logistic 
regression based spatial ranking algorithm can provide significant improvements 
for geographic information retrieval, even when the simplest regional approx- 
imations (MBRs) are used. We have also shown that taking into account the 
portion of offshore areas included in a geographic representation can improve 
GIR performance even further. 
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Abstract. This paper presents a holistic evaluation of an operational informa- 
tion system that employs the Boolean search technique. An equal focus is laid 
on both the system (system perspective) and its users (user perspective) in the 
actual environment where the system and its users are functioning (contextual- 
ity). In addition to these research objectives, the study has a methodological ob- 
jective to test an evaluation approach developed by Borlund [1] in a real life 
setting. Our evaluation methodology involves triangulation (pre-search ques- 
tionnaires; search log; post-interviewing) as well as novel interactive perform- 
ance measures, such as the Ranked Half-Life measure and the Satisfaction and 
Novelty perception by users supplementing the traditional Precision. The study 
confirms the finding of earlier research and reveals the discrepancy between the 
evaluation results according to the system and the user perspectives. More spe- 
cifically, the system performed better when evaluated from the user perspective 
than from the system perspective. 



1 Introduction 

The evaluation of information systems has traditionally been conducted in laboratory 
settings, and focused on the objective performance of information systems. Classical 
information retrieval research aims at developing more advanced and exact algo- 
rithms for retrieval purposes. However, it does not focus on the evaluation of the 
systems from a user perspective. This has been compensated during recent years by 
more user-oriented information retrieval studies, but these suffer unfortunately from 
the lack of standardised tools for analysis. Findings are thus seldom possible to gener- 
alise and difficult to compare. In the light of these difficulties, the evaluation of in- 
formation systems outside laboratory settings is therefore a necessary challenge to 
meet. 

This paper presents an experimental study that aims to conduct a holistic evalua- 
tion of an operational information system. By holistic evaluation we mean that an 
equal focus is laid on both the system (system perspective) as well as its users (user 
perspective) in the actual environment where the system and its users are functioning 
(contextuality). We report the research findings on the use, the efficiency and the 
effectiveness of the information system. First, an information system for newspaper 
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articles was tested by comparing algorithmic relevance judgements [2, 3] with situ- 
ational relevance judgements [4, 3] by a journalist performing a journalistic task. 
Second, the system was evaluated as a resource among other available information 
sources. In addition to these research objectives the study had a methodological objec- 
tive to test a new research approach, which was originally developed in an experimen- 
tal setting [1], in a real life environment. The paper begins with a presentation of the 
theoretical framework and related research findings. The research methods and data 
are then described. This is followed by an analysis of the results. The paper concludes 
with a discussion of the findings. 



2 Information Retrieval for Work Task Completion 

In the present study we aim to evaluate an operational information retrieval system 
from a holistic perspective. The focus is on three dimensions. First, we consider the 
effectiveness of the system in responding to the search query (system perspective). 
Second, we consider the usefulness of the search results for the information need 
(user perspective). Third, we also consider the role of the system among other avail- 
able information resources (contextuality). 

This paper addresses the following research questions: What role does the IR sys- 
tem play for its users, as a resource among other available information sources? How 
well do the relevance assessments of the IR system (algorithmic relevance) corre- 
spond with those of the users (situational relevance)? And: Is the evaluation method 
applied well suited for evaluations of operational IR systems in real-life settings? 

This research setting uses the cognitive model of the IR interaction by Ingwersen 
[5, 6] as a starting point. He suggests that research on information retrieval needs to 
acknowledge the context (a work task or an interest as well as the social environ- 
ment). In addition to the '‘cognitive context” - or as a specific result of it [7], the ac- 
tual resources available are likely to affect the perceived usefulness of an information 
system. We argue that real-life information retrieval is an integrated part of informa- 
tion seeking in general and that both of these processes are seldom separable from the 
overall situation where the information is sought and used [8]. In the present study, 
we consider information retrieval in the context of work task performance conducted 
among other information seeking activities in order to accomplish the task at hand. 

In order to combine a system perspective and a user perspective, we have chosen to 
base the evaluation on algorithmic relevance and situational relevance [2, 3]. Algo- 
rithmic relevance is an objective relevance criterion that is determined through the 
match between a query and the document (used here in a broad meaning: carrier of 
information) representation. These relevance judgements are stable in the sense that 
as long as the query does not vary, neither do the relevance values of the documents. 
This is a common method for ranking search results in information systems. The 
situational relevance criteria depend on the situation where information is used [3, 9, 
10, 11, 19]. This means that the same document, even as a result of the same query, 
may have different relevance values depending of the user and the task process. Rele- 
vance is determined according to the task performer’s perception of the usefulness of 
the document for his/her work task. As algorithmic relevance focuses on the matching 
of the query and the document (mathematical relevance measurement), situational 
relevance focuses on the match between the perceived information need and the per- 
ceived value of the document content (intellectual relevance measurement). 
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To increase the holistic nature of the present evaluation, the system was evaluated 
as an resource among other available sources. To our knowledge, this is an evaluation 
aspect that is neglected in IR studies, where information systems are evaluated as the 
only source of information. Since operational information systems are functioning in 
environments where there are several different kinds of information sources available, 
it is important to recognise that the usefulness of an information system is determined 
in relation to the other available information sources in real-life environments. This is 
an important consideration, since research has shown that people look for different 
types of sources [12]. Similarly, people match communication channels with the type 
of communication matter. [13] Therefore, it is likely that the tack performer is not 
asking the information system to cover all of the information needed for the work task 
completion but certain parts of it [8]. 



3 Research Method and Data 

The research method used in this study is inspired by Borlund's [1] approach to the 
evaluation of (interactive) information (retrieval) systems, the HR evaluation package, 
which was created as an alternative to the traditional, system-oriented approach. The 
main advantages of her approach are that it combines the system perspective with a 
user perspective, and that it ensures both realism and control. It also proposes alterna- 
tive performance measures, which allow non-binary relevance assessments. 

The HR evaluation package contains three components that focus on ( 1 ) an appro- 
priate research setting, (2) empirical recommendations for simulation and (3) alterna- 
tive relevance measures. In order to fulfil the conditions for an appropriate research 
setting, the participants need to be potential users with both individual, (potentially) 
dynamic information needs and relevance judgements (based on an authentic or simu- 
lated situation). Borlund [1] provides several empirical recommendations. She sug- 
gests that both an information need from a simulated situation and from an authentic 
situation ought to be studied. Furthermore, the simulated situation needs to be tailored 
for the test persons (easily identifiable, interesting, relevant and sufficiently informa- 
tive). The order of the simulated and authentic situation ought to be varied in order to 
avoid (possible) learning effects. Finally, she recommends a pilot test. For alternative 
relevance measures, the measure of RR (Relative Relevance) and the RHL (Ranked 
Half-Life) are proposed. 



3.1 Setting 

The participants of the study were newspaper journalists who at the time were em- 
ployed at the second largest newspaper in Scandinavia; Goteborgs-Posten (GP). 
Twenty out of a total of 280 journalists were selected. They represent different gen- 
ders, ages and several different editorial departments. The information system, 
NewsLink, is a manually indexed full-text database containing all articles published 
in GP since 1994 and a selection of articles from 1992-93. The system employs the 
Boolean search technique and offers a choice between showing the retrieved result 
ranked by date or relevance. 
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3.2 Data Collection 

The setting of our study corresponds to Borlund’s approach by involving real users as 
test persons and by using both simulated and real work task situations as a trigger for 
information search. A simulated work task situation is a short “cover story” to frame 
the information search, which is common for all participants. This ensures experimen- 
tal control across the tested system and across the participants in the study. The set- 
ting allowed the collection of both traditional system-oriented data on system per- 
formance and user-oriented data on the behaviour and perceptions of the participants. 
Borlund [1] used four simulated work task situations and one personal situation per 
participant in her study. Since our participants were real users with pressuring time 
constraints, we decided to settle with one of each. We used the following simulated 
work task situation: 

The Terror Attacks in USA September 11 th 

Some time has passed since the terror attacks in USA. GP is now planning a 
follow-up and a summarizing series of articles on the subject, which will il- 
lustrate how different areas have been affected and the long term conse- 
quences. You have been asked to write an article that will illustrate this issue 
from your particular subject field. 

The data collection methods used were: (1) questionnaires on the demographics and 
searching experience of the participants in the study; (2) search protocols (combined 
with observation) designed for the purpose of collecting information on the partici- 
pants’ original and modified query formulations, their non-binary relevance assess- 
ments (based either on title or full-text) as well as the algorithmic rank order, by date 
and relevance respectively, and information about whether or not a document was 
known to the user; (3) a post-search interview considering the practical use of the 
system, and its role for the users as a resource among other available information 
sources. It was also a way to gain additional information about the participants’ rele- 
vance assessments and level of satisfaction. During the course of our study, we found 
a discrepancy between the editorial departments in the usage of NewsLink. Because 
of this, we gathered complementary information by means of a web form directed to 
all journalists at GP. The information we requested focused on their main information 
sources as well as how often and for what purposes they use NewsLink. Both the 
questionnaires, the interview and the web form included questions on the users’ in- 
formation behaviour in a broader sense, which were not included in Borlund’s study. 
This aspect was added to enable a holistic evaluation of the system and its users. 

3.3 Analysis 

The performance measures used in the analysis were RHL (Ranked Half-life), Preci- 
sion, Satisfaction and Novelty. As opposed to Borlund (2000), we did not use the RR 
measure, partly due to technical obstacles and partly to reduce the demands on the 
participants’ time and effort. As a consequence of this no expert panel was used. We 
concluded that the RHL measure provides sufficient data. The Satisfaction [14] and 
Novelty [15] measures were added, as we anticipated these to be noteworthy factors 
affecting the users’ relevance judgements. 
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Precision was calculated at DCV (Document Cut off Value) 15, which is common 
in IR evaluations. It measures the number of relevant articles the system retrieves and 
places among the top 15 documents. Precision was calculated on the respondents’ 
situational relevance assessments made on a non-binary scale. Precision may there- 
fore be more correctly called perceived precision. 

RHL was used as a complement to (perceived) precision. It shows the ability of the 
system to place relevant documents high in the ranked list of retrieved documents, 
that is, its capability of ranking its output according to the end users’ situational rele- 
vance assessments. The RHL measure is based on the common formula for the me- 
dian of grouped continuous data [1, 16]. The RHL value is the median case of the 
continuous data. Each document in the algorithmically ranked list represents a class of 
grouped data where the frequency equals the assigned relevance value. The lower the 
RHL value, the higher the relevant documents are placed in the ranked output, i.e. the 
better the retrieval engine [ 1 ] . 



In order to make the RHL value more comparable, it can be recalculated into an RHL 
index value. This is done by normalising it against a predefined Precision value (Pre- 
cision = 1). This Precision value is divided by the calculated Precision value. The 
quotient is then multiplied by the calculated RHL value, thus resulting in the RHL 
index value [1]. Below is the formula for calculating the RHL index, where the prede- 
fined Precision value is equal to 1 . 



Table 1 shows an example from one of the searches made by a participant in our 
study. Below the table are calculations on the RHL and RHL index, to facilitate the 
understanding of them. The values in the columns for Ranking by Date and Relevance 



1 The highest relevance value (upper real limit) 1 minus the lowest (lower real limit) 




^ F(med) 



( 1 ) 



L m : Lower real limit of the median class, i.e. the 



rank position of the lowest positioned informa- 
tion objects above the median class 



n: Number of observations, i.e. the total sum of the 

assigned relevance values 

Cumulative frequency (relevance values) up to 



and including the class preceding the median 
class 



F(med): The frequency (relevance value) of the median 

class 

Cl: Class interval, commonly in IR = 1 1 



RHL index — — x R 
P 



( 2 ) 



P: 

R: 



Calculated Precision value 
Calculated RHL value 



0 = 1 . 
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Table 1 . Values for relevance assessments made in ranking by date and algorithmic relevance 



Ranking by 
Date 


User’s 

assessment 




Ranking by 
Relevance 


User’s 

assessment 




1 


1 


(1) 


1 


0 


(0) 


2 


1 


(2) 


2 


1 


(1) 


3 


1 


(3) 


3 


1 


(2) 


4 


0.5 


( 3,5 ) 


4 


1 


(3) 


5 


1 


(4,5) 


5 


0 


(3) 


6 


1 


( 5,5 ) 


6 


0 


(3) 


7 


1 




7 


1 


(4) 


8 


0.5 




8 


0 


(4) 


9 


0 




9 


1 


(5) 


10 


1 




10 


1 




11 


1 




11 


1 




12 


1 




12 


0 




13 


0 




13 


1 




14 


0 




14 


1 




15 


0 




15 


1 




Sum 


10 


10/2= 5 


Sum 


10 


10/2= 5 


Precision 


0.67 




Precision 


0.67 




RHL 


5.5 




RHL 


9 




RHL index 


8.21 




RHL index 


13.43 





represent the non-binary relevance assessments made by the respondent; not relevant 
= 0, partly relevant = 0,5, very relevant = 1. The figures used in the calculations are 
stressed in bold. 




RHL index = 




= 8,21 



5-4 

RHL = 8 + | — j — x 1 



= 9 



r 



RHL index = 



l 0,67 



x9 =13,43 



(3) 



This example shows how RHL can distinguish between the results even if the Preci- 
sion values are exactly the same. In the ranking made by date, half of the Precision 
value is obtained by going through just over five documents, while nine documents 
are scrutinised in ranking by algorithmic relevance. The RHL index value makes the 
results comparable by normalising the RHL values to Precision = 1 . 

Novelty [15] was calculated and used in combination with the interview questions 
on the same topic, to give an idea of whether the situational relevance assessments 
were effected by the document being previously known by the user. Satisfaction [14] 
is a measure of the level of satisfaction the respondents feel after completing their 
search task. 



4 Results 

The results focus on the comparability of the relevance judgements assessed by the 
system and by the users. The aim was to determine how well the system serves the 



Evaluation of an Information System in an Information Seeking Process 63 

end users attending their work tasks. The results also focus on the estimated value of 
the system as an information source among other available sources, both in general 
and in relation to the task at hand. 

We placed our respondents in categories based on editorial departments, gender 
and age to facilitate comparisons concerning the usage of NewsLink. As we found no 
discrepancies in the gender age categories, we present the editorial categories: Feature 
group - editorial departments writing longer reports (six respondents); Specialists’ 
group - editorial departments with a more specific direction (five respondents); and 
Local news group - editorial departments writing about domestic and local matters 
(nine respondents). 63 journalists of the total of 280 filled in the web form. Their 
answers correspond with our findings from the rest of the study. 

4.1 The Role of the System Among Other Information Sources Available 

The overall most important information sources are oral sources and the Internet. 
NewsLink and other media are also among the top sources. Oral sources and the 
Internet are generally used daily, while NewsLink is used on a weekly basis. There 
are some differences between the editorial categories. The Local news group uses 
NewsLink most of all, stating that they use the system either daily or weekly. None of 
the respondents in either feature group or in the specialists’ group use NewsLink 
daily. 

Our findings show that NewsLink is an important information source but is often 
used in combination with other sources. NewsLink is mainly used for checking what 
has been written in GP, for finding specific information and avoiding duplicate arti- 
cles. 

The majority of the participants considered themselves as knowledgeable about the 
system. When it comes to search functions, they all stated they have a good grasp of 
free text search, while more than half of them stated that they need more education in 
truncation and Boolean search logic. Our data did at the same time reveal a need for 
end user education. For example, the respondents required already existing functions 
that they were not aware of. 



4.2 The Relevance Assessments by the System and Its Users 

The results for both the Precision and RHL index show that the system performs best 
on the real work tasks, as seen in tables 2 and 3. 



Table 2. Measures of central tendency and measures of variation for Precision at DCV 15 



Precision DCV 15 


Simulated work task 


Real work task 




Date 


Relevance 


Date 


Relevance 


Mean 


0.25 


0.27 


0.3 


0.38 


Standard deviation 


0.156 


0.146 


0.123 


0.162 


Median 


0.21 


0.28 


0.33 


0.38 



Table 2 shows the average Precision values at DCV 15. The median and the mean do 
not differ much. As all mean values for Precision are below 0.5 (the optimal being 
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1.0), the system is not especially efficient. The standard deviation shows that the 
values are relatively coherent. The results were somewhat better for the respondents’ 
real work task, with values generally between 0.18-0.42 as ranked by date and 0.22- 
0.54 as ranked by algorithmic relevance. For the simulated work task the Precision 
values for ranking by date were normally between 0.09-0.41, and ranking by algo- 
rithmic relevance had values between 0.12-0.42. Only as ranked by relevance for the 
real work task, we found values over 0.42. 



Table 3. Measures of central tendency and measures of variation for RHL index at DCV 15 



RHL index DCV 15 


Simulated work task 


Real work task 




Date 


Relevance 


Date 


Relevance 


Mean 


26.14 


19.29 


15.29 


15.59 


Standard deviation 


15.951 


12.761 


9.360 


9.306 


Median 


25.17 


18.27 


13.3 


11.6 



Table 3 shows some differences between mean and median, especially for the real 
work task (as ranked by relevance). This is also visible in the relatively widespread 
values. For the simulated work task, it was necessary to go to places 10.19 to 42.09 to 
find enough relevant documents as they were ranked by date, while when ranking by 
algorithmic relevance they were found at places 6.53 to 32.05. The corresponding 
values for the real work task was 5.93-24.65 as ranked by date and 6.28-24.9 as 
ranked by algorithmic relevance. 

The results from the precision and RHL index show that the system generally per- 
forms better in relation to real work tasks than for simulated work tasks. One explana- 
tion might be that the participants are familiar with the requests of the work task and 
more motivated to obtain useful results [8]. As for the simulated work task, the width 
of the topic made it applicable to all the different editorial departments, but it may 
also have made relevance assessments more perfunctory. The topic was somewhat 
difficult to place in time, which may have favoured the ranking by relevance. 



Table 4. Value of Satisfaction for the editorial categories respectively, with reference both to 
the real and the simulated information need 



Feature group 


Specialists’ group 


Domestic group 


Total 




Real 


Sim. 


Real 


Sim. 


Real 


Sim. 


Real 


Sim. 


Yes 


6 


4 


4 


1 


6 


2 


16 


7 


In part 


0 


0 


0 


2 


0 


2 


0 


4 


No 


0 


2 


1 


2 


3 


5 


4 


9 


Total 


6 


6 


5 


5 


9 


9 


20 


20 



Our results for Novelty and the post-search interview show that the relevance assess- 
ments are usually affected by whether or not a document is new to the user depending 
on the work task at hand. As for Satisfaction (see table 4), there is a significant differ- 
ence between the editorial categories. In the Feature group, a majority of the respon- 
dents were satisfied with their search results for both work tasks, while in the other 
categories a majority were only satisfied with the real work task. 
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5 Discussion 

5.1 Conclusions of the System Evaluation 

The effectiveness of the system in responding to the search query was measured using 
the Precision and RHL index. Our results show that NewsLink often performs poorly 
on relevance values for the documents. On the other hand, we found the system to be 
effective from a user perspective, as shown by the results for Satisfaction. Findings 
from earlier research indicate that relevance ranking by system and by users often do 
not correspond [17, 18]. This is an important aspect of system evaluation from a user 
perspective and especially interesting since the evaluated system is manually main- 
tained, thus requiring higher maintenance costs. 

Our results revealed a discrepancy between the results for the system-oriented and 
the user-oriented measures. Some of the respondents retrieved only one document, but 
this was the one relevant document required, while others wanted to make sure that 
there were no earlier articles on the subject in GP, (i.e. for them no retrieved relevant 
documents was optimal). Both examples resulted in high satisfaction but very poor 
Precision and RHL-index values. In addition, the involvement of real end users led to 
dynamic information needs (cf. [11, 19]) also affecting the values for Precision and 
RHL. The Novelty measure shows that as the respondents retrieved two or more 
documents about the same subject, only the first one was judged relevant (cf. [20]). 
Measures of effectiveness, such as Precision and RHL index, judges such documents 
equally relevant. 

Our findings show that although NewsLink is important for the journalists in their 
daily work, it does not function as their only or primary information source. A major- 
ity of the respondents stated oral sources as being the most useful, turning to 
NewsLink mainly for background material. There was a strong correlation between 
the journalistic task at hand and the use and role of the system. The Local news group 
uses NewsLink most and, contrary to the other groups, value oral sources more than 
the Internet. Also in this respect our findings correspond to other studies on journal- 
ists’ seeking behaviour (cf. [21]). 

It is of major importance that the system is well suited to its (potential) users. Oth- 
erwise, it may as well be useless, no matter how well it performs according to effec- 
tiveness. The system evaluated in this study, NewsLink, seems to be adapted for the 
journalists working at Goteborgs-Posten. The system is experienced as effective from 
the users’ point of view and judged as an important information source among others. 
Our general conclusion is that the study reported here highlights how difficult an 
evaluation of an operational IR system is. Our results for Satisfaction and Novelty 
confirm how important it is to consider qualitative measures as well as quantitative 
performance measures such as Precision, especially when it comes to evaluations in 
real life settings. Even when NewsLink was performing poorly, its real users were 
satisfied with it. Whether the reason for this is that the users had become used to the 
system or that the system is performing “well enough” or something else, it is still 
clear that knowing the users and their context is a key factor in successful system 
design. 
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5.2 Conclusions of the Method Evaluation 

The methodological objective of the study was to test Borlund’s “HR evaluation 
package” in a real life setting. We found that the simulated work task situation func- 
tioned well in many respects. It did not control the participants in detail. Less than a 
third of the words used in the queries were directly from the description of the simu- 
lated work task situation. They used “terror” in different combinations and “Septem- 
ber”. However, we became aware that composing a simulated work task situation that 
offers a sufficient level of reality for all participants, must be done with great care. 
Moreover, the importance of using at least one real work task cannot be overvalued. 
The familiarity of the task requirements and a higher motivation lead to better values 
for Precision, RHL index and Satisfaction than they did in relation to simulated work 
tasks. 

The performance measures we used were chosen because they consider both the 
system and its users. We found our results for RHL and Precision to be very similar. 
The reason for this is that both measures are depending on the values of the relevance 
assessments. In spite of this, they both provide complementary information. Our ex- 
ample (3) shows that the RHL measure can indicate which system is most effective 
even where the Precision values are identical. When only the top fifteen documents 
are taken into consideration, the difference in effectiveness will be marginal, but 
when it comes to larger amounts of documents, the placing of relevant documents 
may be of great importance. Precision is a measure of the proportion of retrieved 
documents that are relevant, but it reveals nothing of where the relevant documents 
can be found. This is highly relevant for the users though, as they seldom are inter- 
ested enough to look through large amounts of documents. Since the RHL-indicator 
may be somewhat misleading it ought to be normalised into an RHL index value to 
produce more easily interpreted and comparable results. 

Applying a holistic approach to IR evaluation has major advantages. Several fac- 
tors in our study prove the importance of involving the real users of the system. Re- 
spondents were satisfied with their search results despite poor results according to the 
effectiveness measures, mainly because the documents, instead of being judged in 
isolation, were valued in an authentic search situation. This points to the fact that 
using Precision and other system oriented measures in this kind of evaluation is de- 
batable. Journalists need few but highly relevant documents placed in top positions in 
the ranked list, thus enabling swift access to them. Since they constantly work under 
time pressure it is difficult, if not impossible for them to look through huge amounts 
of documents. These working conditions are now becoming increasingly common 
within several professions, which means that evaluations in general should include the 
users of the system and user-oriented measures as a complement to the system- 
oriented ones. 

We were also able to determine the type of information that was retrieved from the 
system. These kinds of results are novel in evaluation context, but nonetheless very 
important for making a correct evaluation of an information system. If the users de- 
liberately use different sources for different kinds of information as indicated by find- 
ings of Bystrom [12], the system may still be highly satisfactory to its users even if it 
only provides them with certain kinds of information. Calculating Precision from 
dynamic situational relevance assessments made on a non-binary scale improves the 
significance of the measure, but Precision values can not tell whether or not the users 
are satisfied and have achieved work task fulfilment. Combining both user- and sys- 



Evaluation of an Information System in an Information Seeking Process 67 



tem-oriented measures and adding post search interviews, enables a fuller and more 
accurate picture of the system, its users and context. 

To sum up, we mean that the evaluation method used in this study is well suited for 
evaluations of operational systems, covering system, user and context. It aims to pro- 
vide an overall view of how well the system suits its users and the system’s role 
among other available information sources. The approach as such has functioned well 
and provided a solid methodological base. The measures used have yielded valuable 
information about the system from a users’ point of view. These different measures 
functioned well and generated different types of information to complete each other. 
We look forward to developing and testing the methodology in additional studies. 
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Abstract. This paper focuses on fiction electronic books and their usability. 
Two complementary studies were drawn together in order to investigate 
whether fiction e-books can successfully become part of people’s reading hab- 
its: the Visual Book project, which found that electronic texts which closely re- 
semble their paper counterparts in terms of visual components such as size, 
quality and design were received positively by users, and the EBONI Project 
which aimed to define a set of best practice guidelines for designing electronic 
textbooks. It was found that the general guidelines for the design of textbooks 
on the Internet that have been proposed by the EBONI project can also be ap- 
plied to the design of fiction e-books. Finally, in terms of the electronic produc- 
tion of fiction e-books, this study suggests that concentrating on the appearance 
of text, rather than the technology itself, can lead to better quality publications 
to rival the print version of fiction books. 



1 Introduction 

This paper describes a study into the usability of fiction e-books while verifying the 
applicability/portability of some of the findings reported on educational e-books 
across literature genres. 

The study is based on two relevant projects: 

• The Visual Book project [1], which investigated how to produce better quality 
electronic publications by focusing on the impact of appearance of information, 
and 

• The EBONI project [2], which investigated the importance of considering the 
user in the design of electronic books. 

Both studies focused on educational material, with the Visual Book restricted to 
scientific texts and EBONI considering e-textbooks across disciplines in higher 
education. Previous work suggests that consulting an e-book for study or reference is 
a very different experience to reading for pleasure, in which the process is much 
closer to that of reading a paper book [3]. This paper reports on an experiment 
devised at studying whether principles for designing the visual components of 
electronic textbooks can be transported into the fiction genre. Will a fiction e-book 
presented in accordance with EBONI’ s design guidelines increase user satisfaction 
and usability? 
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1.1 The Fiction e-Books 

In order to determine whether the EBONI project’s guidelines can further the pur- 
poses of the Visual Book project, a fiction e-book in three different formats was con- 
sidered: The Adventures of Gerard, by Sir Arthur Conan Doyle. 

The three versions of the specific electronic book were chosen in terms of format, 
functionality and availability. The book is available free on the Internet, and users are 
allowed to include it in personal Web sites. 

The three versions of the e-book were evaluated with respect to usability and sub- 
jective satisfaction issues: 

• Scrolling Book 

• Portable Book with software applications (Adobe Ebook Reader R PDF) 

• Portable Book with software applications (Microsoft Reader) 

The purpose was to determine whether the PDF and the MSReader versions of the 
text, that share many of the characteristics of EBONI’ s guidelines for designing e- 
books, would perform better than the “Scrolling” version of the fiction e-book. 

The aims of this study, therefore, were: 

• To study whether the presentation of a fiction book in electronic format that 
shares the EBONI project’s guidelines in terms of visual components (such as 
size, quality and design) increases satisfaction and usability. 

• To compare the results of this study with the results of the EBONI project which 
focused on the design of learning and teaching material on the Internet. 

Note that it was not within the scope of this study to examine ebook hardware (port- 
able ebook) issues. These were previously investigated by EBONI and the findings 
have been reported [4, 5]. 



1.2 The Visual Book and the Web Book Experiments 

The Visual Book experiment, conducted between 1993 and 1997, was part of a more 
general project called SuperLibrary and highlighted the importance of appearance in 
the design of electronic textbooks [ 1 ] . 

The Visual Book project started from the observation that, 

the appearance of information contributes positively to its overall value and 
that because there is an almost infinite number of possible ways to represent 
various kinds of information, it is very important to find the one which is go- 
ing to be the most effective and which conveys as much of the value of the 
original information [ 1 ] . 

The idea was that, because people know how to read books and use tables of contents 
and indexes, maintaining the same model on screen would facilitate access to elec- 
tronic information. The experiment concluded that the book metaphor is an important 
aspect in defining guidelines for the design of electronic books. 

In general, the results of the evaluation of the Visual Book, which were supported 
by the findings of a similar project, the Hyper-Book [6], showed that the book meta- 
phor was both accepted and understood by its evaluators. Furthermore, the results 
highlighted the need for a new role in electronic publishing: “the designer of elec- 
tronic books, as the person in charge of final appearance” [7], 
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The Web Book project investigated this issue with respect to the production of 
books on the Web” [7], Despite the fact that the experiment was conducted on a small 
scale, the results indicated that making texts on the Web more scannable (according to 
Morkes and Nielsen’s Guidelines [8]) has a positive effect on their usability. 

1.3 EBONI Project 

EBONI built on the work of the Visual Book and the Web Book. The aim of the pro- 
ject was to compile a set of guidelines for the publication of electronic textbooks, 
reflecting the usability requirements of the UK higher education community [9]. 

The following were among the evaluations which applied a specific methodology 
developed for EBONI [2]: 

• An evaluation of three textbooks in psychology, all of which had been published 
on the Internet by their authors. The three textbooks were evaluated by second, 
third and fourth year psychology undergraduates in UK higher education. 

• An evaluation of Hypertext in Context by McKnight et al [10], The textbook was 
compared in three formats: print, the original electronic version on the Web, and 
a second electronic version revised according to Morkes and Nielsen’s guide- 
lines for “scannability” [8]. 

• A comparison of three electronic encyclopedias: Encyclopedia Britannica, The 
Columbia Encyclopedia, and Encarta. 

• A comparison of a title in geography by second year geography undergraduates 
that is available in three electronic formats: MobiPocket Reader, Adobe Acrobat 
Ebook Reader, and Microsoft Reader. 

• A study into usability issues surrounding portable electronic books. Lecturers 
and researchers at the University of Strathclyde evaluated five devices in order to 
determine which elements enhance and which detract from the experience of 
reading an electronic book. 

The results of these studies were then re-elaborated to form a set of Electronic Text- 
book Design Guidelines (http://ebooks.strath.ac.uk/eboni/guidelines/). 



2 Methodology 

In order to be able to compare results of the fiction e-book study with a previous cor- 
pus of findings, the evaluation methodology developed by the EBONI project [2] was 
adhered to as closely as possible, and the content of the questionnaires used in this 
study was replicated with few adjustments. 

Since this research involved a fiction book, rather than the textbooks used in the 
EBONI experiments, a number of differences in procedure were observed. 

First of all, the specific nature of fiction made completing a series of tasks while 
reading a book impractical, and participants were left to decide how they would ex- 
plore the book - whether they would simply choose to read a chapter, browse through 
it, or even read all the chapters. Participants had only to complete three questionnaires 
(one for every version of the book) after reading the book, so that their responses 
were informed and based on experience. 
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Further, several of the items in the questionnaire used in the EBONI experiments 
were omitted due to their inapplicability in assessing the usability of a fiction elec- 
tronic book. For example, the words “concise”, “frustrating”, “interesting”, “like- 
able”, and “useful” were removed and replaced with direct questions such as: “Was 
the text easy to read?”, “Was the book easy to navigate?”, “How frustrated did you 
feel by the appearance of the book?”, and “What did you like or dislike about reading 
the specific version of the e-book?” 

2.1 Participants 

Twenty-five subjects comprising respondents to emails sent to the wider public, and 
to lecturers and Postgraduate students in Computer and Information Science at Strath- 
clyde University participated in this experiment. A level of Internet experience was 
assumed, because participants were contacted by email and the experiment, which 
involved reading a book and filling out a form, was conducted entirely on the web. 

2.2 Selection of Material 

The text selected for use in the study was The Adventures of Gerard by Sir Arthur 
Conan Doyle. This was chosen not only because it was one of the limited titles avail- 
able in the desired three versions (Scrolling Book, PDF format, and MS Reader), but 
also because it was thought that the story would provide enough interest to ensure that 
participants would enjoy reading it. Three versions of the text were used in the study: 

Scrolling Book: Provided by Project Gutenberg (http://gutenberg.net/index.html), the 
book is very simple in format and is presented according to a scroll metaphor. The 
first part of the text includes information about Project Gutenberg and copyright is- 
sues. The main part of the text contains the book by Sir Arthur Conan Doyle. The e- 
book is not divided into pages and the text scrolls almost without any physical limita- 
tion. The information is presented according to a book style hierarchy, made of chap- 
ters, subchapters, paragraphs and sections, but everything is displayed on the same 
page. 

The next two versions are software applications, also known as e-book readers. 
These provide extra functionalities such as annotations, bookmarks, different fonts 
and colors to help users in their reading/scanning process. 

Adobe Ebook Reader: This version of the book was provided by Nalanda Digital 
Library in India (http://www.nalanda.nitc.ac.in/index.html). It has the typical PDF 
format and provides a series of functionalities to the reader such as bookmarks, 
thumbnails, and the ability to change fonts and size. The text has the physical look of 
a book, with a single numbered page appearing on the screen at any time. More spe- 
cifically, some of the functionalities that it offers are as follows: 

• Adjust size: allows readers to adjust the size of the book (actual size, fit in win- 
dow, or fit width). 

• Bookmarks: readers can mark the last part of the book that was visited and return 
to it later. 

• Find: helps readers to search for words or phrases quickly and easily. 
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• Go to previous/next view: takes readers to the part of the book that was visited 
previously. 

• Graphics/text select tool: allows readers to select and process specific parts of 
the text. 

• Move first/previous page: readers can navigate through the pages themselves. 

• Print: readers can download and print the e-book. 

• Rotate text: allows readers to change the orientation of the text on the screen. 

• Thumbnails: allows readers to view all pages of the book on the left side of the 
screen. 

• Zoom in/out tool: readers can use this facility to get a close-up view of text and 
graphics. 

MS Reader: The third version of the book was provided by the Virginia Digital Li- 
brary (http://etext.lib.virginia.edu/ebooks/Plist.html) and can be read with Microsoft 
Reader. This is the most complicated version of the book and offers a plethora of 
functionalities. Some of these functionalities are described below: 

• Clear type: improves the clarity of text on standard LCD screens, delivering a 
print-like display. 

• Navigation: the “Riffle Control” allows readers to easily turn pages or skip to 
another page in the book using their keyboard or mouse. 

• Font size: allows readers to increase or decrease font size from the settings page. 

• Find: helps readers to search for words or phrases quickly and easily. 

• Pan and zoom graphics: readers can use this facility to get a close-up view of 
graphics and pictures. After zooming in, they can pan around the graphic to take 
a closer look at any area. 

• Bookmarks: always appear in the page margin and they are filled when readers 
are on the bookmarked page, otherwise only their outline is shown. Readers can 
also change the color of bookmarks to suit their preferences. 

• Library: all the books and other content that readers acquire are stored in the li- 
brary. Readers can organise items in their library to appear by title, author, last 
read, e-book size, or date acquired. 

• Notes: readers can use their keyboard to add written comments to any page. 

• Drawings: readers can choose from a wide range of colors to circle words, un- 
derline text, or add any other type of mark to a page. 

• Annotations: personal annotations - highlights, bookmarks, notes, and drawings 
- are stored in one location and can be easily organised. 

• Highlights: readers can call attention to a word or passage by highlighting it with 
a stroke of the mouse, as they would do with a highlighter in a paper book. 

• Dictionary: allows readers to look up word meanings through the built-in 
Lookup functionality. 

2.3 Procedure 

Every stage of this experiment was carried out over the Internet. Emails inviting par- 
ticipation in the study were sent to the wider public, to students who had studied In- 
formation and Library Studies at Strathclyde University, and to lecturers in the De- 
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partment of Computer and Information Sciences at the University. The emails ex- 
plained briefly the purpose of the study, told potential respondents it would involve 
visiting a Web site, reading three versions of a fiction electronic book and answering 
some questions, and directed them towards a URL to begin the survey. 

On visiting the URLs, participants were first asked to complete some details about 
themselves. They were asked about their age, gender and occupation and, to judge 
their degree of familiarity with fiction e-books, the questions: 

1 . Prior to this study, had you read a fiction e-book? 

2. If you are not currently using fiction e-books is it because: 

• generally you have not considered them 

• you consider that they offer no advantages over print 

• you think that you would experience access problems 

• you think that they are difficult to use 

• you think that they are difficult to find 

To minimise learning effects, participants were then asked to read the three versions 
of the book in any order. In the Subjective Satisfaction questionnaire participants 
were asked to describe how easy it was to learn to use the book, read through it, and 
navigate. The first part asked them about specific aspects of working with the book, 
while the second part asked them to rate a list of adjectives according to how well 
they describe it. Finally, respondents were asked to add any comments about the ex- 
perience of reading a fiction book in an electronic format, and whether they would 
read fiction e-books in the future. 



2.4 Measurement of Results 

The subjective satisfaction index was the mean score of the following two indices: 

• Ease of use. This part included four questions: “Compared to what you expected, 
how quickly did you learn to use the e-book?”, “Was the text easy to read?”, 
“Was the book easy to navigate?”, and “How frustrated did you feel by the ap- 
pearance of the book?” 

• Quality. The first question consisted of four adjectives that described the book: 
annoying, engaging, helpful and unpleasant. The second question asked readers 
to rate the various functionalities offered by each version of the book on a scale 
from “very helpful” to “not very helpful”. 

To judge whether participants liked or disliked reading an e-book, they were asked to 
summarise their views by answering the following questions: 

• What did you like about reading the specific version of the book? 

• What did you dislike about reading the specific version of the book? 

Finally, respondents were asked to respond with a simple “yes” or “no” to the follow- 
ing question: “Would you read a fiction e-book in the future?” 



3 Results 



Results are presented in terms of ease of use and quality in Table 1. 
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Table 1. Mean scores for the two major measures, on a scale from one to 10 





Ease of use 


Quality 


Scrolling Book 


6.9 


5.3 


Adobe Ebook Reader 


7.1 


6.8 


Microsoft Reader 


5.8 


5.8 



Next, the overall subjective satisfaction score for each version of the fiction e-book 
was calculated, by adding the mean scores of the two measures (ease of use and qual- 
ity). These results are presented in Table 2. 



Table 2. Mean scores for the two major measures, and overall usability 





Ease of use 


Quality 


Overall Subjective 
Satisfaction 


Scrolling Book 


6.9 


5.3 


6.1 


Adobe Ebook Reader 


7.1 


6.8 


7 


Microsoft Reader 


5.8 


5.8 


5.8 



3.1 Users’ Comments 

In accordance with the results of the questionnaire, users’ comments about the Scroll- 
ing version of the book were generally negative, while comments about Adobe Ebook 
Reader and Microsoft Reader were more positive. 

Users of the Scrolling version liked the fact that the book was easy to download 
without having to first of all install any special programs. One user noted that it was 
“fairly easy to access, and because it was plain text, it would be very easy to copy and 
paste sections”. Overall, respondents were satisfied with the fact that the text was easy 
to download and quite simple in format. However, they elaborated more on their an- 
swers when they were asked to describe what they disliked. 

A lack of user-friendliness was commented on twice, with one user complaining 
that he/she found it “very user-unfriendly because it was hard to work out where the 
book actually started - there was a lot of additional information at the beginning 
which I was not interested in reading” and another stating that, “Scrolling made it 
hard to read easily. The font and layout was also a bit unfriendly”. One user described 
the book as “monotonous and quite confusing”, and another stated that it was “not the 
most inspiring format, and having to scroll down through the whole document instead 
of jumping to a particular chapter was annoying”. Another participant reported that, 
“the scrolling sometimes jumped more lines that I wanted so I had to go back to read 
the start of each paragraph”. Two participants were dissatisfied with navigation: one 
reported “too much preamble at beginning. Not able to see how far you have got, i.e. 
pages read and pages to go”, while another noted, “the typeface is unattractive and the 
page looked crowded. It felt as if you could easily get lost reading it (especially if you 
were a bit tired) because of the type of text - unclear layout”. 

Adobe Ebook Reader version elicited more positive responses. “Clear, well-spaced 
typeface, easy to resize and attractive to look at”, wrote one participant, and another 
commented that the book was “colourful, interesting, easy to use and quick to navi- 
gate”. One user reported that it was “much more attractive than the Scrolling version”. 
Another stated, “It was like reading a book, unlike the scrolling version, you don’t 
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have to move your hand until you find the right page”. A third user reported that it 
was “Much more user-friendly in comparison to the first version. Much more ‘book- 
like’ format/layout”. Two participants liked the appearance of the book, with one user 
stating, “I liked the fact that it looked like a book. It was easy to read and you could 
make changes to its appearance”, and another observing that “The design of the e- 
book actually looked like the pages of a real book, which made it much more pleasant 
to read”. 

Seven out of the 25 respondents commented that there was nothing they disliked in 
the Adobe Ebook version, with one user noting, “I don’t think there was anything I 
didn’t like. I am a great fan of .pdf files” and another reporting, “[I disliked] nothing, 
it was very pleasant to read”. However, four participants commented on the fact that it 
takes a while to download the text. A couple of participants also complained about the 
highlight facility: “If you want to underline, that’s not really possible with a PDF 
file”. There were also few comments about the overall appearance of the book. One 
participant, although he/she liked the particular version, still did not like the represen- 
tation of the book in an electronic format: “Did not dislike anything, but would still be 
unlikely to read it. Still prefer an actual physical book”. “I disliked the fact that al- 
though it looked like a book it did not feel like one”, wrote another user. 

Users also made some positive remarks about Microsoft Reader. Six participants 
commented on navigation and the extra features offered by the Microsoft Reader; one 
wrote, “it was easier to navigate than the other two formats” and another noted, “I 
liked the format of the text and the extra features the MS-Reader offers”. A couple of 
readers liked the fact that they could interact with the book, and one commented, “I 
liked everything about it. It looked and felt like a real book. I could not believe that 
you could draw inside the text, and that you could also make notes or highlight”. 

Even though the majority of participants liked the overall appearance of the text 
and the extra features, they were very critical when they had to add their own remarks 
on what they disliked about reading this version. A lack of good navigation features 
was commented on twice, with one user complaining about “lack of icons and not so 
good navigation features as the PDF”. Several users had problems in downloading 
Microsoft Reader or even get the program to run. “I disliked the fact that I had to 
download a program”, one participant wrote, while another reported, “Sorry, after 
downloading Microsoft Reader I downloaded the book 3 times but still could not get 
it to work - despite help menu: it seems very technical and difficult to use and I have 
given up on it for the time being”. Also, eight participants reported that they would 
prefer not to have to download extra software in order to read the book: “[I disliked] 
having to download specific software. I think e-books will only increase in popularity 
when people can read them with absolutely no extra effort . . . people tend to have a 
low patience threshold when it comes to computers!”. 



3.2 Analysis and Discussion 

The purpose of this study was to explore whether the presentation of a fiction book in 
electronic format that adheres to the EBONI project’s guidelines in terms of visual 
components (such as size, quality and design) increases users’ satisfaction and overall 
usability of the text. 
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The results of the experiment have shown that users of Adobe Ebook Reader, 
which adheres closely to the guidelines, reported highest subjective satisfaction. In 
particular: 

• Users of the Adobe Ebook Reader and Scrolling Book reported a higher score 
when it comes to ease of use of the site. 

• Users of the Adobe Ebook Reader and Microsoft Reader reported a higher score 
when it comes to quality of the site. 

• When combined into an overall satisfaction score, Adobe Ebook Reader has the 
highest score. 

However, although Adobe Ebook Reader scored highly, Microsoft Reader was found 
to be more difficult to use than the Scrolling Book, despite also adhering to the design 
guidelines, and this requires further investigation. Participants’ comments and their 
ratings of quality were positive, but they thought it quite difficult to use. On the con- 
trary, the Scrolling version achieved a high score (6.9/10) for ease of use and, al- 
though it had the lowest score for quality, it still managed to perform better than Mi- 
crosoft Reader in the overall subjective satisfaction score. 

It is unlikely that users read the entire book in three different formats, and this may 
have affected scores. Users are already familiar with the scrolling metaphor and with 
PDF, but are less familiar with ebook software. When using the books for only a short 
time, it seems likely that this unfamiliarity with Microsoft Reader software may have 
had a negative impact on reported ease of use. 

In terms of the electronic production of fiction books in general, this study pro- 
vides an example of how concentrating on the appearance of text, rather than the 
technology itself, can lead to better quality publications to rival print versions. 

Therefore, the general guidelines for the design of textbooks on the Internet as pro- 
posed by the EBONI project can also be applied to the design of fiction e-books. In 
both studies, analysis of the results has indicated that adherence to the book metaphor 
increases users’ subjective satisfaction and overall usability of the book. 

In particular, participants confirmed the importance of the following guidelines for 
the design of fiction e-books: 

Tables of Contents. Tables of contents are an essential feature in both print and elec- 
tronic media, used by readers to skim the contents of an unfamiliar book to gain an 
idea of what can be found inside. They also provide the reader with a sense of struc- 
ture, which can easily be lost in the electronic medium, and can be an important navi- 
gation tool. In the words of one participant, “I liked that it looked like a usual book 
and the fact that I could find any chapter I wanted easily simply by using the table of 
contents”. 

Fonts. Fonts should be large enough to read comfortably for long periods of time. If 
possible, readers would like to choose a font style and size to suit their individual 
preferences, thereby satisfying the needs of those with perfect vision and those with 
low vision or reading difficulties. Fonts which include specific special characters such 
as italics should be used, and a colour that contrasts sufficiently with the background 
should be chosen. As one participant in the experiment noted, “change of font is a 
welcome facility”. 
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Search Tool. Tables of contents offer access points for browsing. These can be sup- 
plemented by search tools which provide another method of finding information in an 
electronic text, and are appreciated by readers. A choice of simple searches (searching 
the whole book, a chapter, or a page for a keyword) should be offered to suit different 
levels or reader. As one participant noted, “I did not like the fact that I could not per- 
form a search in the Scrolling version of the fiction book”. 

Navigation Icons. Participants strongly valued the fact that they could make use of a 
set of navigation buttons that enabled them to move forward and back in the book, 
skip chapters, and choose particular parts of the text. However, the function of any 
navigation icons should be explicit. As one participant noted, “the navigation menu 
was sometimes difficult to follow”. 

Bookmarks. Participants expressed a desire to have bookmarking facilities, which 
they would like to be straightforward and quick to use. 

Highlight Facility. Participants also expressed a desire to have a highlight facility, 
which is not always provided in e-books. Readers appreciated the highlight facility 
available in the Microsoft Reader version of the book because it allowed a degree of 
interaction. In the words of one reader, “[I] liked the highlight and drawing facilities 
offered by the MS Reader - allowed you to interact with the book. Made it feel more 
real”. 

Participants also found it difficult and unpleasant to read long streams of text on 
screen. “Not the most inspiring format, and having to scroll down through the whole 
document instead of jumping to a particular chapter was annoying” one participant 
noted, while another one reported “I disliked the feeling that there were no pages and 
the continuous format was very tiring”. These comments were provided for the Scroll- 
ing version of the book and it illustrates that it is important to divide the book into 
short chapters, with short pages, and short paragraphs. 

Readers gain a sense of their place in a printed book via the page numbers and by 
comparing the thickness and weight of the pages read against the thickness and 
weight of the pages still to be read. Participants complained that they did not have this 
option while reading the Scrolling version of the book and one of them noted “Too 
much preamble at beginning. Not able to see how far you have got, i.e. pages read and 
pages to go”. 

Participants also expected the background of the book to be in colour. As was sug- 
gested in EBONI, colour makes the book more appealing and interesting. Readers in 
this experiment preferred to read the book by having more interesting colours in the 
background than grey and black. 



4 Conclusions 

The study described in this paper looked into a specific type of e-books, fiction e- 
books, and provides an indication of future steps which could be taken to make them 
easier and more enjoyable to read, and of course more suited to the needs of the wider 
public. As noted by Rao [11], for e-books to change people’s reading habits, the in- 
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congruence with user expectations about how books are handled needs to be investi- 
gated and overcome. 

This experiment was conducted on a small scale, and so the results are just indica- 
tions of the potential effect of altering the appearance of fiction e-books to make them 
more attractive and practical to use. Nonetheless, these indications are positive and in 
tune with the findings of previous studies, which formed the background to the ex- 
periment. 

Another issue is that the experiment took place over the Internet and thus it could 
be assumed that all participants are computer literate and have at least a basic knowl- 
edge of how to use the Web. However, it would be interesting to carry out a study 
with participants who have little or no computer experience to determine to which 
version of electronic books they can adapt more easily. Thus, researchers and devel- 
opers of electronic books will have a clearer view about the needs of the wider popu- 
lation and not only of the academic community. Finally, it would be meaningful to 
allow users to pick their favourite titles and provide them with a real choice so that 
their reactions and motivations would be more realistic. 

Indeed, most academic libraries already include a certain number of e-books in 
their stock, and some public libraries are experimenting with offering e-books to their 
readers by circulating dedicated portable readers. It therefore seems that much re- 
search on the introduction and use of electronic books could be undertaken within 
libraries. 
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Abstract. Digital library programmes often seek to provide interoperability 
through use of open standards. In practice, however, deployment of open stan- 
dards in a compliant manner is not necessarily easy. The author argues that a 
strict checking regime would be inappropriate in many circumstances. The au- 
thor proposes deployment of quality assurance (QA) principles which provide 
documented policies on the standards and best practices to be implemented and 
systematic procedures for measuring compliance with these policies. The paper 
describes the work of the QA Focus project which has developed a QA meth- 
odology to support JISC’s digital library programmes. A summary of the appli- 
cation of the methodology to support selection of standards and the deployment 
of deliverables into service is given. The author argues that similar approaches 
are needed if we are to provide interoperability across digital library pro- 
grammes. 



1 Introduction 

The need for open standards in order to provide interoperable digital library services 
is widely acknowledged. In addition to use of open standards there is also a need to 
make use of agreed best practices in the provision of digital library services. 

Although such principles are widely accepted in the digital library community, in 
practice appropriate standards and best practices are not always used. This can happen 
for a number of reasons, some of which are legitimate (immaturity of open standards, 
a lack of tools, etc.) However a failure to use appropriate solutions may be due to 
inertia on the part of the developer, a failure to understand the need for open stan- 
dards, a failure to appreciate appropriate architectures for open standards, lack of 
agreement on a definition of ‘open standards’ or a mistaken impression that open 
standards are being used. There is therefore a need to provide a model which seeks to 
exploit the potential of open standards, but is capable of addressing the challenges this 
can provide in a flexible manner. 

This paper reviews the quality assurance methodology and support materials de- 
veloped by the JlSC-funded QA Focus project which aims to ensure that JISC’s digi- 
tal library programmes are functional, widely accessible, interoperable and can be 
deployed easily into a service environment. Particular emphasis is given to the appli- 
cation of the quality assurance framework in the selection of standards and the de- 
ployment of project deliverables into a service environment. 
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2 Traditional Approach to Support 

2.1 Background 

NOF-digitise is a digital library programme in the UK supported by public funding of 
about £50 million. The programme funds universities, museums, libraries, etc. to 
digitise materials from their collections and archives in order to make this cultural 
heritage available online. 

Although organisations with proven expertise in digitisation work were funded, a 
number had little experience of large-scale digitisation activities. The programme 
provided a valuable opportunity for public sector bodies to enhance their expertise in 
this area; expertise which would be valuable in supporting in-house development 
activities. 

The importance of open standards was emphasised from the start. It was a require- 
ment that the projects addressed the need for potential reuse of the resources. In order 
to support over 150 projects the NOF-digi Technical Advisory Service (NOF-TAS) 
[ 1 ] was established. Early activities of NOF-TAS included producing the NOF-digi 
Technical Standards and Guidelines document [2] and organising workshops to sup- 
port the projects. This was complemented by an email support list and a series of 
FAQs. 

NOF-TAS was not responsible for monitoring projects’ compliance with the stan- 
dards. This work was carried out by BECTa. NOF-TAS worked with BECTa in de- 
veloping a self-assessment reporting procedure. A reporting template was used to 
allow projects to report on compliance with standards. Projects were expected to 
document areas in which they were failing to make use of appropriate open standards. 
It was recognised that there were areas in which open standards would be difficult to 
implement: e.g. areas in which the standards were immature, with limited availability 
of authoring tools and poor support for viewers. For example in the area of synchro- 
nised multimedia the preferred open standard is SMIL (Synchronized Multimedia 
Integration Language); however this format is not yet ready for mainstream use. The 
proprietary alternative many projects preferred was Flash. 

In response to such challenges the reporting procedure required projects to docu- 
ment: 

• Reasons why they intend to make use of a proprietary format. 

• Reasons why open standards could not be used. 

• The scope of their proposed use of proprietary solutions. 

• Migration strategies to open standards if they become more readily available. 

• Indications of funding issues to support the migration. 

It was permissible, for example, to develop an interactive game using Flash; how- 
ever it would not be permitted to produce an entire Web site in Flash or use Flash to 
provide site navigation or to use it simply because of availability of in-house expertise 
in the format. The process is described in a NOF-TAS FAQ [3]. 



2.2 Limitations of This Approach 

Although the work of NOF-TAS was highly appreciated by the projects and the fun- 
ders the support model used by the service did have limitations: 
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• There is a danger that projects may regard open standards as something imposed 
upon them. Organisations which carried out the project work will not necessarily 
have embedded a standards-based approach throughout their organisations. 

• Projects may regard checking compliance as something carried out by external 
bodies and may not have developed in-house checking procedures. 

• A formal compliance checking regime may not be well-suited in other develop- 
ment environments. 

• It has not been possible to maintain the standards document, support materials, 
etc. following the end of the project funding. 



3 The QA Focus Approach 

QA Focus has been funded by the JISC to support JISC’s digital library programmes. 
QA Focus began its work in January 2002 with funding initially for two years (subse- 
quently extended by 7 months). QA Focus, along with NOF-TAS, is provided by 
UKOLN and the AHDS (Arts and Humanities Data Service). However QA Focus 
takes a different approach to NOF-TAS. Rather than providing technical support di- 
rectly to projects QA Focus has developed a quality assurance methodology to be 
deployed by the projects themselves. This approach is based on self-assessment; 
unlike the NOF-digi programme no compliance checking is provided by third parties. 
The quality assurance methodology is described in section 4. 

The QA framework is complemented by its support materials consisting of briefing 
documents and case studies, together with an online ‘toolkit’. Over 60 briefing docu- 
ments are available which provide focussed advice in various technical areas. The 
documents provide advice on why particular standards are needed, the advantages and 
disadvantages of various implementation approaches, common problems and ap- 
proaches for ensuring compliance with standards or best practices. 

The case studies, which are written by projects themselves, help in community- 
building by allowing projects to share implementation experiences. In order to avoid 
projects using the case studies as a publicity vehicle a template is provided which 
requires authors to give a description of their project, the problem being addressed in 
the document, the solution used and problems experienced or lessons learnt. 

Other important areas which have been addressed include the selection of stan- 
dards for use by projects and the deployment of project deliverables into service. 
These areas are summarised in sections 5 and 6. 

In addition a series of online toolkits have been developed which provide interac- 
tive self-assessment of use of appropriate standards and best practices and a series of 
surveys of Web sites has been carried out using a variety of testing tools. 

Further information on these resources and on the QA Focus project is available 
from the QA Focus Web site [4]. 



4 QA Methodology 

At the core of the work of the QA Focus project is its quality assurance (QA) meth- 
odology. The work is based on well-established QA principles. We feel that in order 
to provide functional, widely accessible and interoperable deliverables projects need 
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to document their technical policies and implement systematic procedures which 
ensure that the policies are being implemented correctly. We acknowledge that pro- 
jects often have limited resources and are subject to tight timescales, so we have de- 
veloped a lightweight QA methodology. 

An example of a technical policy is illustrated below. 



Area: Web Access 
Standards: XHTML 1.0 

Exceptions: Resources derived from MS Office applications may not comply with HTML 
standards due to the limitations of Microsoft’s conversion program. 

Implementation Architecture: The Web site uses PHP scripts for processing metadata, 
navigational bars, etc. PHP template files will comply with XHTML 1.0. Content fragments 
will be edited with an XHTML-aware authoring tool. 

Compliance Checking: When pages are created or updated the author is responsible for 
running the , validate tool to ensure XHTML compliance. A batch check of the Web site 
will be carried out quarterly. W3C’s Web Log Analysis tool will run monthly to detect the 
most widely accessed pages which are non-compliant. 

Audit Trails: Reports of the Web Log Analysis tool and batch audits will be kept. 

Addressing Non-Compliance: Page authors are responsible for ensuring their pages are 
compliant. 

Responsibilities: The project manager is responsible for enforcing this policy. 

Fig. 1. Example of a Technical Policy Statement 

This policy and related policies on CSS standards and link checking have been im- 
plemented for the QA Focus Web site. As can be seen such policies need not be oner- 
ous to develop. As well as documenting the standards to be used, the implementation 
architecture is also described. This will help ensure that an appropriate architecture is 
used. The compliance checking regime is documented, and, in recognition of real- 
world complexities, details of permitted exceptions are given. 

It should be noted that we have implemented a lightweight technique to simplify 
compliance checking procedures. In particular appending , validate to the end of 
any URL on the UKOLN Web site will run the W3C validation program on the page. 
Similarly appending, cssvalidate will run a CSS validator, appending, rvali- 
date will validate the current pages and pages beneath it, appending, checklink 
will run a link checker on the page and appending, rchecklink will run a link 
checker on the page and pages beneath it. This simple interface to a range of testing 
services can be implemented using a simple update to a Web server’s configuration 
file as described at [5]. 



5 Selection of Standards 

Although the merits of open standards are widely acknowledged deployment of open 
standards is not always easy. There will be times when open standards are immature, 
with limited availability of authoring tools and viewers, or open standards fail to 
reach critical mass. Even in areas in which open standards are mature we do not al- 
ways see open standards being used correctly: for example many Web sites are not 
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compliant with the HTML standard. A more complete review of the difficulties ex- 
perienced in using open standards within digital library programmes is given in [6]. 

In light of such issues there is a need for a methodology for selecting standards. It 
would clearly be inappropriate to abandon a commitment to the philosophy of open 
standards, and yet more forceful mandating of use of and compliance with open stan- 
dards may well prove counter-productive. The approach taken by QA Focus is the use 
of a checklist for the section of standards. We have developed a checklist which illus- 
trates the range of factors which should be considered when initially selecting the 
standards to be used within a project, as illustrated below. 



Table 1 . Checklist for Choosing Standards 



Area 


Issues 


Ownership 


Is standard owned by a recognised open standards body? 


Development process 


Is there a community process for developing the standard? 


Availability 


Has the proprietary standard has been published? 


Viewers 


Are viewers (a) available for free, (b) available as open source 
and (c) available on multiple platforms? 


Authoring tools 


Are authoring tools (a) available for free, (b) available as open 
source and (c) available on multiple platforms? 


Fitness for purposes 


Is the standard appropriate for the purpose envisaged? 


Resource issues 


What are the resource implications in using the standard? 


Complexity 


How complex is the standard? 


Interoperability 


How interoperable is the standard? 


Service deployment 


How easy will it be to deploy the deliverable into service? 


Preservation 


Is the standard suitable for long term preservation? 


Migration 


What approaches can be taken to migrating to more appropriate 
standards in the future? 


Measuring compliance 


What approaches can be taken to measuring compliance? 



We envisage that projects would complete a checklist and use this to aid the dis- 
cussions of the standards to be deployed. A record of the issues and decisions made 
should be kept, which could require approval by an external body but, in other cases, 
may be documented in project reports without the need for external approval. 



6 Service Deployment 

Many project deliverables will be expected to be deployed into service. However an 
easy transition into service cannot always be guaranteed for a number of reasons: 

• Software, resources or expertise may not be available in the target service. 

• Project deliverables may not fit in with the service’s strategic aims. 

• Concerns over technical quality, costs or legal issues in deploying the deliver- 
ables. 
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We recommend that projects should provide information about the technical envi- 
ronment, identify potential service environments and have an understanding of issues 
of concern to services. Projects should make their QA policies available to potential 
service providers in order to help address possible concerns and facilitate the deploy- 
ment of project deliverables. 



7 Conclusions 

This paper has argued that in order to enhance the interoperability of digital library 
projects there is a need to deploy quality assurance. QA Focus has developed a prag- 
matic lightweight QA framework which acknowledges the resource and deployment 
pressures faced by projects. 

We feel the approaches described in this paper will be of interest to other digital li- 
brary programmes. We welcome the opportunity to explore possibilities of working 
with other digital library programmes. To support this we are exploring the possibili- 
ties of making our resources available with a Creative Commons licence [7]. 
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Abstract. To date, the majority of Web search engines have provided simple 
keyword search interfaces that present the results as a ranked list of hyperlinks. 
More recently researchers have been investigating interactive, graphical and 
multimedia approaches which use ontologies to model the knowledge space. 
Such systems use the semantic relationships to structure the assimilated search 
results into interactive semantic graphs or hypermedia presentations which en- 
able the user to quickly and easily explore the results and detect previously un- 
recognized associations. More recently, the proliferation of eResearch commu- 
nities has led to a demand for search interfaces which automate the discovery, 
analysis and assimilation of multiple information sources in order to prove or 
disprove a particular scientific theory or hypothesis. We believe that such semi- 
automated analysis, assimilation and hypothesis-driven approaches represent 
the next generation of search engines. In this paper we describe and evaluate 
such a search interface which we have developed for a particular eScience ap- 
plication. 



1 Introduction 

Traditionally Web search engines have provided simple keyword search interfaces 
which retrieve relevant documents and present the results as a list of hyperlinks 
which the user has to click through one at a time [1]. The Semantic Web [2] is begin- 
ning to enable more interactive, graphical and multimedia search-and-browse inter- 
faces which leverage semantic relationships between retrieved information objects. 
Technologies such as machine-processable semantic annotations, ontologies and 
semantic inferencing rules and engines are enabling automated reasoning about com- 
plex relationships and a shift towards automated integration and analysis of retrieved 
documents and data. Researchers are developing systems that can assimilate, struc- 
ture and present large amounts of mixed-media, multi-dimensional data and informa- 
tion as interactive semantic graphs or hypermedia presentations - greatly enhancing 
the capacity of domain analysts to process information and mine new knowledge. 

In addition, the recent proliferation of eScience communities has led to a demand 
for more sophisticated search interfaces which assist users to interpret experimental 
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data in order to prove or disprove a particular scientific theory or hypothesis. We 
believe that eScience will drive the next generation of search engines - interactive 
‘hypothesis refinement’ interfaces which automatically retrieve, process, assimilate 
and present relevant information in such a way that the user can see whether there is 
sufficient evidence to corroborate their hypothesis or if it needs refinement. Assum- 
ing the hypothesis refinement process does produce promising results, the next re- 
quirement is to be able to capture and store the hypothesis and its associated prove- 
nance data and body of evidence. This will enable future collaborative sharing, dis- 
cussion and defense of new theories and help prevent duplication of analytical or 
experimental activities. 

The research that we describe in this paper, focuses on the design, prototyping 
and evaluation of a data exploration and hypothesis-driven search interface called 
FUSION, that supports indexing and querying of complex semantic relationships and 
is driven by notions of information trust and provenance and the interactive investiga- 
tion, development and capture of hypotheses. Although we have developed FUSION 
for a particular eScience application, the optimization of fuel cells by fuel cell ex- 
perts, the research described here is applicable across any domains (e.g., science, 
engineering, homeland security, social sciences and health) that are attempting to 
solve complex problems through the analysis and assimilation of large-scale, mixed 
information and data sets. 

The remainder of this paper is structured as follows. The next section describes 
related work, the background and objectives. Section 3 describes the architectural 
design of the system and the motivation for design decisions that were made. Sections 
4 and 5 describe the interactive data exploration and hypothesis generation interfaces 
respectively. Section 6 describes the results of evaluating the system on real fuel cell 
data and images. Section 7 contains concluding remarks and plans for future work. 

2 Related Work and Objectives 

Hypothesis formulation involves finding local interrelations (hypotheses) among 
attributes within large databases of high dimensionality [3]. Since finding all possible 
interrelations is an infeasible task for many such databases, current research is con- 
cerned with the problem of finding potentially promising hypotheses, which can be 
further verified. This problem is tackled by a number of technologies including statis- 
tical data mining [4], data visualization, clustering and image processing of visualized 
data. In this paper we focus on a novel interactive visualization approach. 

Visualization of large data sets is not new - a large amount of research has been 
undertaken on the application of visualization to data mining and knowledge discov- 
ery. Visualization can provide a qualitative overview of large and complex datasets, 
summarize data, find patterns, correlations, clusters or exceptions in data sets and 
greatly assist with exploratory data analysis. A comprehensive overview of data visu- 
alization techniques can be found in [5]. These approaches mainly apply to purely 
numerical data, do not support heterogeneous data and mixed-media objects (e.g., 
images, audio, video, text) and don‘t employ Semantic Web technologies to infer or 
visualize semantic associations. 
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Systems like Flamenco [6], Topia [7], CS AKTive Space [8] are examples of 
early efforts at blending specific information exploration goals with well-associated 
contextual information. Information from multiple heterogeneous sources are com- 
bined and presented to provide an integrated view of a multi-dimensional information 
space. Polyarchy visualizations [9] and mSpaces [10] are two formalisms recently 
employed to visualize semantic relationships between multiple information objects. 
Other examples include work by researchers at DSTC [11] and CWI [12] who have 
been working on automatic generation of multimedia presentations based on the se- 
mantic relationships between mixed-media information objects. The common objec- 
tive of all of these systems is to provide interactive browsers which present semanti- 
cally-associated information visually. The methods for visualizing relationships be- 
tween database attributes or information objects varies depending on the nature (i.e., 
types, formats, size, granularity, dimensionality and subject) of the data and informa- 
tion objects. These information objects may be spatial, temporal, spectral, visual, 
audio, textual, 3D, numerical, arrays, matrices, web pages or scholarly publications. 
Graphs (2D and 3D), animations, virtual reality, hypermedia, map interfaces and 
combinations of these, have all been employed to visually represent knowledge bases 
or information structures. 

The objectives of our work are to enable scientists to solve particular scientific or 
engineering problems by presenting the relevant data in an integrated, synchronized 
and coherent way that facilitates the discovery of new relationships or patterns that 
would not be possible through traditional search interfaces. More specifically we 
wanted to develop a system that combines visualization techniques with semantic 
inferencing and applies them to both multimedia information and multi-dimensional 
data. Consequently our objectives were to: 

• Provide a search, browse and data exploration interface which allows users to in- 
teractively formulate hypotheses. 

• Enable users to define their own mappings from semantic relationships between 
objects and data to preferred spatio-temporal presentation modes. 

• Provide a hypothesis testing interface which allows users to quickly and easily 
specify their hypotheses, see whether there was any evidence to support this theory 
and modify or refine it based on the visual/graphical feedback 

• Determine standardized methods for defining, recording and exchanging hypothe- 
ses (e.g., RuleML) and their associated, corroborative, evidential and provenance 
data which have been aggregated within a multimedia object (e.g., SMIL + 3D). 

• Enable storage, search and retrieval of past hypotheses. This captures domain ex- 
pert knowledge, enables its re-use and refinement, reduces duplication and provide 
evidence and provenance for experimental results. 

• Enable annotations of stored presentations - particularly those that reveal new or 
interesting trends. Semantic annotations (based on domain-specific ontologies) of 
presentations enable their retrieval and re-use for further knowledge mining. 

• Test and evaluate the system within the context of a particular eScience applica- 
tion. In our case, we have chosen ‘fuel cell optimization’ because it is a typical sci- 
entific problem involving a large number of variables and data types. 
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2.1 eScience Example Scenario 

Fuel cells offer an alternative, clean, reliable source of energy for residential use, 
transport and remote communities. Their efficiency is dependent on the internal struc- 
ture of the fuel cell layers and the interfaces between them. Electron microscopy 
generates images of cross-sectional samples through fuel-cell components that reveal 
complex multi-level information. Simple macro-level information such as the thick- 
ness of the cell layers, surface area, roughness and densities can be used to determine 
gas permeation of the electrode materials. Nano-level information about the elec- 
trode's internal interface structure provides data on the efficiency of exchange reac- 
tions. Figure 1 illustrates the range of image data obtainable. 




Fig. 1 . Microscopic images of a fuel cell at 3 different magnifications 



By digitising the images and applying image processing techniques (MATLAB) to 
them, the amount of information expands even further to levels where human proc- 
essing is not possible and more sophisticated means of data mining are required. In 
addition to the microstructural information revealed by the images, there are the 
manufacturing conditions and processing parameters used to produce the cell con- 
figurations. Finally, for each cell configuration, performance data is available and the 
crux of the project is to marry the microstructural data with manufacturing and per- 
formance data to reveal trends or relationships which could lead to improvements in 
fuel cell design and efficiency. Table 1 shows the range of parameters we are dealing 
with, in addition to the fuel cell images captured at different magnifications. 



Table 1. Fuel Cell Parameters 



Fuel cell characteristics 


Performance 


Manufacturing 


Layer thickness 

Composition 

Density 

Particle Size and Shape 
Nearest neighbors 
Surface area 
Porosity 

Surface roughness 


Strength 

Density 

Conductivity graph 

Efficiency 

Lifetime 


Wt% Y203 - Zr02 
Wt% A1203 
Wt% Solvent 
Solid Content 
Viscosity 

Tape Speed and Thickness 
Drying Temperature and Time 
Cost 



The aim of the work described here was to build and test an interactive interface 
which will enable fuel cell experts to quickly and easily explore the fuel cell images 
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and data in order to determine associations or patterns between parameters, formulate 
and validate hypotheses, and save hypotheses and associated corroboratory evidence 
to share with others and keep as a historical record that tracks past investigations. 



3 System Architecture 

Figure 2 illustrates the overall architecture and major components of the FUSION 
system. 




Users access multiple distributed repositories through a Web browser (Microsoft 
Internet Explorer) and an ODBC interface - no specialized software is required on 
the client side except IE 5.5 or higher and an SVG plug-in. The Microsoft implemen- 
tation of SMIL, HTML+TIME [13], is used to build the multimedia presentations. It 
allows spatio-temporal relationships between information objects as well as visual 
effects (such as fading between images) to be implemented. Dependencies between 
values are represented graphically using SVG [14], Presentations and graphs are 
dynamically generated using Python scripts. The HTML forms and pull-down menus 
presented to the user are generated from domain-specific (OWL[15]) ontologies, 
described in earlier work [16] and which are specified during system configuration. 
The architecture is extremely flexible and can quickly and easily be adapted to any 
domain by connecting to different backend ontologies and knowledge repositories. 
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The two main system components, that are described in the next two sections, are: 

1 . Data Exploration; 

2. Hypothesis Formulation. 

3.1 Data Exploration 

The data exploration process consists of four stages, as shown in Figure 3. 




Fig. 3. Four stages of the data exploration process 



Initially, the user chooses the aspects of the fuel cell data which they are interested in 
by selecting the parameters from the data set to be viewed. For example, porosity, 
efficiency and cost. The Query Interpreter (Figure 2) transforms the user’s selection 
from the HTMF form into a format that the Data Interface can process. The Data 
Interface refers to the metadata structure (harmonized ontologies) to determine which 
knowledge repositories should be queried. In this example, the performance and 
manufacturing data repositories are queried. The retrieved results include the unique 
IDs for the fuel cells matching the query. A request is sent to the image database to 
retrieve images of the fuel cells matching these IDs. The results of the search are 
submitted to the HTMF interface for the execution of stages 2 and 3. 

The HTMF interface allows users to specify his/her preferences for displaying the 
retrieved results. Users can specify the following display preferences: 

• an ordering parameter for structuring and presenting the results; 

• selection of any additional parameters to be displayed; 

• the type of presentation mode required (time-based or static); 

• preferred data presentation formats (values displayed graphically, or in a list); 

• any special effects to be applied to the presentation (e.g., fading etc.). 

Figure 4 shows the user interface in which the user has specified that they wish to 
“order retrieved fuel cell data by increasing efficiency” and “also display values for 
porosity and cost”. These additional specifications are submitted to the Query Inter- 
preter, which reformulates the request to the Data Interface (“for previously retrieved 
fuel cell IDs retrieve corresponding values for efficiency, porosity and cost”). The 
results of the reformulated query are processed by the Data Interpreter, which trans- 
forms them into the necessary format for the Presentation Generator. The Presenta- 
tion Generator makes decisions about the spatio-temporal layout of the result sets 
based on the users’ preferred presentation mode, format and any special effects. 

Figure 5 shows the user interface for specifying presentation preferences. There 
are three possible presentation modes to choose from: a time-based mode (slide- 
show) or one of two possible static modes - interactive and thumbnail views. When a 
large number of images are retrieved as a result of a search, the slide-show mode will 
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display the images automatically without the need to click the next button. The de- 
fault speed for a slide-show is 2 seconds per screen, but this can be adjusted. Alterna- 
tively, a static mode (interactive or thumbnail tiled view) can be selected to allow 
viewing of the results without any time restrictions. The interactive mode requires the 
users to press the next button to move to the next fuel cell image. 



FUSION 

Data exploration 

Selecl a parameter for orderin g images 



Fuel cell characteristics 
a p<jifoinianc« 
f Manufacturing 




Sdluct additional parameter to be displayed 



Fuel cell chaiacterlsllcs Performance Manufacturing 





.... 


i. , ,, 1 .,. ., „,.n . 


_>.rr| i xti r ^ 








uireJoa veteqi 


rc*«-c 




1 





Next »> 



FUSION ™ 

Presentation settings 

Select a presentation mode 



Time-based 

if Slide- Show : hxjoifl'O te'.te tQ&l'J 

S'jaedcf Jiv waieriUrlrj' (■ poi itraQO 

r T -i.1 rvj etfc.’.r 'on y "nr s.wrl -• ?, e 6C.nrofi par i •iage) 

Non lime - based 

f* Intel active 

r Thumbnail view t>. 

Additional settings 



r 0--i.ii 

<f Bu II the graphby print 
r siw* tar- ■•..-•np.lntn g uvh -igilghwl pa :lr. 

| View presentation | 



Fig. 4. Data Organization 



Fig. 5. Presentation Settings 



The slide-show mode also displays an animated graph together with the images and 
any additional parameter values chosen by the user. The SVG graph plots one or 
more parameter values against time, and is generated dynamically in synchronization 
with the fuel cell images. This enables users to relate visual features in the images to 
manufacturing and/or performance data. In addition, fading effects can be applied to 
the images in the slide-show. This is helpful for distinguishing differences across 
sequential images. 

The thumbnail presentation mode lays out thumbnail images for all of the re- 
trieved fuel cells in a tiled structure ordered by the chosen parameter. In addition any 
requested parameter values are displayed below each corresponding thumbnail. Users 
can click on a thumbnail image to view it in full-size with all requested parameters 
and values listed below it. 

Figure 6 illustrates the results of a slide-show presentation - the final stage of the 
data exploration process. The data exploration interface has been designed to enable a 
user to interactively explore large mixed-media, multidimensional data and informa- 
tion sets. By enabling users to choose the presentation style which best suits them, 
and to focus on the range and scope of data sets that most interest them, the system 
maximizes the potential and speed at which domain experts can discover new inter- 
dependencies or trends within the data or develop hypotheses. If the user finds an 
interesting pattern or association they can save the HTML+TIME+SVG presentation, 
together with the associated metadata (Unique ID, Date/Time, Creator, Settings, Ob- 
jective) or move on to the next stage, the hypothesis testing interface, which is de- 
scribed in the next section. 

All screenshots in this paper are also available at [17]. 
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Fig. 6. Screenshot of a slide-show presentation with animated graphs 



3.2 Hypothesis Testing 

The design of the data exploration interface was based on a user-needs analysis and 
user feedback, as well as certain assumptions regarding the usefulness of particular 
features and presentation modes for displaying large amounts of heterogeneous data 
and information. The design of the hypothesis testing interface, however, was based 
on an analysis of the cognitive process of hypothesis testing and scientific discovery. 
Hypothesis formulation does not occur spontaneously but is an interactive, evolution- 
ary process which grows out of background experience and assumptions which lead 
to ideas about relationships within the data, which the scientist wants to verify. Re- 
search into the process of conducting tests and experiments has shown that the hy- 
pothesis formulation workflow depends on whether results are expected or unex- 
pected [18, 19]. If results conform to a particular hypothesis, then the work continues 
forward with the verification of further hypotheses. If an unexpected result occurs, 
further testing is done in order to explain the result. An unexpected result may occur 
due to an erroneous primary assumption or methodological errors. Otherwise, unex- 
pected results may lead to a new discovery. Taking this into account, we developed 
an interface that enables users to specify their hypotheses, define prerequisites for 
validating and testing hypotheses and attach explanations to the results which are 
obtained. The workflow for the hypothesis testing process is shown in Figure 7. 
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Fig. 7. Four stages of the hypothesis testing process 



Consider the following example. The user wants to test the hypothesis: “IF substrate 
width is greater than 1 2 |im AND density value lies between 5-10 particles/um 2 
THEN efficiency is greater than or equals 80%”. A HTML interface generated from 
the back-end ontologies enables the user to specify such a hypothesis. Figure 8 illus- 
trates the user interface which supports the first and the second phases in Figure 7. 
Figure 8 consists of the following sections: 

• Descriptive Metadata - for each hypothesis, the system generates the following 
metadata: Unique ID, Date/time and the Creator/Researcher ID. 

• A free text field that enables researchers to record the background motivation for 
this particular hypothesis. 

• A searchable list of all previously conducted experiments. Documenting past work 
enables sharing of earlier hypotheses, the layering of new hypotheses and helps re- 
veal conflicts between hypotheses and prevent duplication of research. 

• The bottom left section provides the interface for entering a hypothesis. It is di- 
vided into two parts: the conditional part of a hypothesis (if statement) and the con- 
sequence part of a hypothesis (then statement). Multiple sub-statements can be 
combined within the if or then statements using logical connectors (AND/OR). The 
bottom right part of the window contains dynamically filled lists of then statements 
that match the specified if statement on the left hand side. This mechanism helps to 
indicate dynamically, whether a hypothesis with the same if statement has previ- 
ously been tested and what the outcome of this investigation was. If a matching 
previously-tested hypothesis is found, then the user can click on the then statement 
and retrieve a complete record of the results of that investigation. 




Fig. 8. Hypothesis specification 



Fig. 9. Results of Hypothesis testing 
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The process of translating a hypothesis into a set of queries and retrieving the rele- 
vant data is identical to the process described in Section 3.1. The values/ranges that 
are retrieved for the specified parameters are sent to the Hypothesis Testing compo- 
nent, which attempts to verify or refute the hypothesis. The verification results to- 
gether with the hypothesis itself, are passed through the Data Interpreter and Presen- 
tation Generator to produce a presentation. Dependencies between specified parame- 
ters are displayed within dynamically-generated SVG graph(s). The hypothesis itself 
is transformed into RuleML [20] format that users can choose to save to a Xindice 
[21] repository of stored hypotheses. 

Figure 9 illustrates a presentation that was generated following the specification 
and submission of a hypothesis to the knowledge repository. The hypothesis state- 
ment is displayed at the top of the screen. The set of graphs displayed beneath the 
hypothesis statement, depict the dependencies between parameters specified in the 
hypothesis and provide feedback to the user on whether or not there is any evidence 
to support their hypothesis. For the example hypothesis given above, two graphs are 
generated. One plots density against substrate width. The other plots efficiency 
against substrate width. Users are able to either save this hypothesis (with an explana- 
tory annotation) or go back and make changes to the original hypothesis and resubmit 
this to the knowledge base. A complete record of the saved experiment/hypothesis 
consists of: 

• The metadata for the hypothesis: unique ID, date, and author; 

• The motivation or background for the hypothesis; 

• The hypothesis itself in the form of an if-then statement; 

• The results of applying the hypothesis to the knowledge repository - an 
HTML+TIME+SVG presentation; 

• An outcome attribute specifying whether the results were positive or negative; 

• An annotation field, entered by the user, which contains a possible explanation for 
the results that were obtained. 

An XML metadata record and the RuleML representation for the example hypothesis 
given at the start of this section can be found at [17]. 



4 Evaluation 

User testing of the system has been carried out by fuel cell scientists from The Uni- 
versity of Queensland’s Centre for Microscopy and Microanalysis. Feedback from 
the users to date has indicated the following: 

• The user interface design and incorporation of domain-specific ontologies [16] 
allowed users with little knowledge of the domain, to quickly and easily explore 
the data and gain an understanding of the knowledge space; 

• Different users carry out research activities differently. Being able to customize or 
personalize the mode, scope and focus of the assimilated data presentations and the 
hypothesis refinement process was very beneficial for individual productivity; 
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• The different presentation modes enabled faster processing and interpretation of 
large data sets and images by the fuel cell scientists than was possible manually 
and expedited the hypothesis generation and refinement process; 

• Slide shows of images synchronized with animated graphs that plot corresponding 
requested parameter values, were the most popular method of data exploration and 
hypothesis formulation; 

• The fading effect was useful for detecting subtle image differences; 

• Static presentations and graphs can be incorporated directly into scholarly publica- 
tions, reducing the time required to disseminate research results; 

• Being able to record, browse and retrieve past investigations and hypotheses, re- 
duced duplication and enabled existing hypotheses to be refined or new hypotheses 
to be developed based on past work. It also provides a way of capturing and shar- 
ing tacit domain expert knowledge, explicitly, in the form of rules; 

• Existing automatic hypothesis testing techniques (e.g., statistical analysis) only 
work on quantitative data. A major advantage of The FUSION system’s approach 
is that it applicable across a range of data and media types. 

• The saving of evidential and provenance data with hypotheses, enables the validity 
of earlier hypotheses or assumptions to be assessed by other scientists - who are 
able to attach their own opinions in the form of annotations; 

• The use of semantic web technologies such as ontologies, annotations and infer- 
encing rules, provide a consistent, machine-processable way for describing, captur- 
ing, re-using and building on the domain knowledge. It also enables better collabo- 
ration between distributed research laboratories and industry through improved 
sharing of knowledge and data. 

An on-line demonstration of the prototype system is available at [22]. Users need to 
be using IE 5.5 or higher and an SVG plug-in, such as Adobe’s plugin [23]. 

5 Conclusions and Future Work 

In this paper we describe a search interface that enables scientists to interact with a 
knowledge base through a hypothesis-driven approach that combines data explora- 
tion, integration, search and inferencing - enabling more complex analysis and 
deeper insight. We believe that such interfaces represent the next generation of search 
engines and that they are will be increasingly in demand and applied across many 
domains including science, engineering, homeland security, social sciences and 
health, to solve complex problems and provide decision support tools based on the 
analysis and assimilation of large-scale, mixed-media, multi-dimensional information 
and data sets. 

Plans for future work include: 

• Further testing and refinement of the system, particularly within a real-world 
industrial environment. We plan to deploy it within a fuel-cell manufacturing com- 
pany to facilitate the exchange of knowledge between university research and in- 
dustry organizations in this domain; 
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• Integrating statistical data analysis methods and applying them to hypotheses for- 
mulated through our system, to fit more precise mathematical models to relation- 
ships between parameters; 

• Investigating how the empirical modeling approach described here can be com- 
bined with the physical modeling approach to generate a more accurate predictive 
model for simulating fuel cell behaviour. 

• Testing the portability, flexibility and scalability of the system by applying it to 
other domains, such as environmental modeling and bioinformatics. 

Looking even further into the future, we envisage that instead of users interactively 
submitting hypotheses to such a system, there will be pro-active systems which are: 
constantly dynamically assimilating new information; using existing, stored hypothe- 
ses to automatically detect anomalies, problems, or exceptional events; inferring new 
hypotheses and knowledge; and notifying users by returning actionable information. 
But we still have a long way to go before such intelligent or sophisticated systems 
become widely available. 
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Abstract. A virtual hyperbook is a virtual document made of a set of 
information fragments linked to a domain ontology and equipped with se- 
lection and assembly methods or rules. In this paper, we study the prob- 
lem of accessing and reading in a digital library of virtual hyperbooks. In 
this case it is necessary to generate hyperdocuments that present infor- 
mation and knowledge originating from several hyperbooks. Moreover, 
these hyper-documents must fit with the reading objectives or specific 
point of views of readers. Our approach is based on the integration of 
domain ontologies and the re-use of interface specifications. 



1 Introduction 

A virtual document is a set of information fragments associated with filtering, 
organisation and assembling mechanisms. Depending on a user profile or user 
intensions, these mechanisms will produce different documents adapted to the 
user needs. The idea of virtual document has emerged from research on ‘pre-Web’ 
hypertext systems, such a MacWeb [20] and, more recently, on adaptive and 
personalized hypertext systems. Given the rapid development of theoretical and 
practical tools in this domain, it is reasonable to think that digital libraries will 
incorporate virtual documents in addition to traditional electronic documents. 

It is thus interesting to explore the new accessing and reading possibilities 
that a library of virtual documents can provide. The main distinction between 
a traditional digital library and a virtual document library is the disappearance 
of the monolithic character of a book or an article. The ability to select and 
assemble informational fragments coming from various virtual documents opens 
new perspectives on the reading action, but it also raises important questions. In 
a digital library of virtual documents, a document reading system should be able 
to compose new documents from all the available informational fragments of the 
library, according to the readersi objectives. For instance, a reader wishing to 
obtain some information about the concept of recursion should get a document 
containing a definition of this term, eventually other definitions that represent 
alternative point of views, examples and exercises drawn from various virtual 
documents, historical notes, etc. We can also consider that a virtual book, once 
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inserted into a library, will automatically enrich itself by connecting to frag- 
ments of other books (new examples and exercises, new comments about several 
concepts, etc.). 

In these two cases it is obviously necessary to check the semantic compati- 
bility of the fragments before re-using them. The objective is to deliver to the 
reader new documents that are semantically coherent. For this, we propose an 
approach based on ontology integration and on reusability of virtual document 
interface specifications. 

1.1 Hyperbooks and Virtual Documents 

For several years, the concept of virtual document has been studied in different 
contexts and from different perspectives. Research on hypertexts has tackled 
several problem areas that are related to our study. Systems like Intermedia [14] 
or Storyspace [2] were developed essentially for producing hypertext literature, 
others like KMS [1] or MacWeb [20] have aimed at the management and shar- 
ing of knowledge. Concepts like links, anchors, composition of nodes, etc. were 
studied in detail. This led, among other results, to the definition of the Dexter 
reference model [16]. Various models and systems have also been proposed for 
the integration of books and electronic documents into lryperbooks. This con- 
cerns the transformation of paper books into hypertext [22] or into electronic 
books [19], writing directly in electronic form (hypertext) [11] or also integrating 
existing electronic documents [3] . The hypertext personalization problem led to 
the definition of models and techniques for adaptation and adaptivity [7], [4]. 
The capacity of adaptation corresponds to the presentation of different or dif- 
ferently organized contents, depending on a user profile. Adaptivity consists in 
automatically updating the user profile according to his or her behaviour. A well- 
known example of the adaptability is the change of colors of the links leading 
to already visited web pages. In [25], the authors propose a model of adaptive 
hypertext which includes a domain model, a user model and adaptation rules. 
The domain model is a semantic network consisting of domain concepts and rela- 
tions between concepts. This model serves essentially to define adaptation rules, 
depending, for instance, on the concepts known or appropriated by the user. 
More recently, a research held has emerged that concentrates on the concept 
of personalizable virtual documents [6], [13]. Personalizable virtual documents 
are defined as sets of elements (often called fragments) associated with filtering, 
organization and assembling mechanisms. According to a user profile or user 
intensions, these mechanisms will produce different documents adapted to the 
user needs. For instance, in [5] Crampes and Ranwez define pedagogical virtual 
documents. Garlatti and Iksal [17] proposed a comprehensive and detailed model 
of virtual documents based on several ontologies. 

There is presently no consensus on a common virtual document model. Nev- 
ertheless, most of the proposed models are comprised of (at least) a domain 
ontology and a fragment base. These model generally differ on the user interface 
part, i.e. how to specify the production of user-readable documents. Existing 
models use declarative languages, pedagogical or narrative ontologies, inference 
rules, or other mechanisms. 
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1.2 Ontology Integration 

The integration and re-usability of ontologies plays a major role in the domain 
of virtual books and a fortiori in the domain of virtual libraries. If we suppose 
that each virtual book has its own domain ontology, we need an integration 
technique to create a semantically coherent virtual library. It is important to 
note here that it is not realistic to suppose that all the virtual books will refer 
to the same (global) ontology, because either such an ontology does currently 
not exist or, even if it existed, it would contain only stable and well established 
concepts (thus it would not be convenient for books on new and advanced topics). 

The literature about ontology integration is indeed very heterogeneous. As 
a starting point, we can refer to [21] and [18] for drawing a typology of the 
principal methods of integration. There are two major approaches to ontology 
integration, namely, alignment and fusion. 

Alignment techniques try to bring two ontologies into mutual agreement by 
establishing correspondence links between the concepts of the two ontologies [18] . 
As a subcategory, mapping techniques intend to relate corresponding concepts 
or relations by an equivalence relation. In both cases, the existing ontologies will 
persist. This integration process is often chosen if the ontologies cover comple- 
mentary domains. 

The fusion of ontologies consists in creating a new coherent ontology by 
merging or matching concepts. This process is often quite complex because it may 
require, among others, the creation of new concepts in order to relate concepts 
from the two ontologies. It is thus very difficult to automate. Nevertheless, there 
exists environments and tools, like Chimaera [15], to help merging and diagnosing 
multiple ontologies. 

In our particular case of digital libraries, we should take into account the fact 
that such a library can be very evolutionary. The arrival of new documents will 
require a constant process of integration so that the ontology remains adapted 
to the digital library. Thus, the fusion approach is probably not adequate for 
integrating hyperbooks in a digital library. Hence, the approach we propose is 
based on alignment and mapping. 

In the rest of this paper, we first propose, in section 2, a simple model for 
virtual documents (virtual lryperbooks). Next, we describe in section 3 our multi- 
point of view approach of ontologies and the integration process of hyperbooks 
into digital libraries. In section 4, we will show how to use the model to define 
documents for ‘global’ reading in an integrated library. The conclusion briefly 
presents the implementation techniques for realizing prototypes of digital li- 
braries. 



2 Virtual Hyperbook Model 

The lryperbook model we use is comprised of a fragment repository, a domain 
ontology, and an interface specification, as shown in Fig. 1. The fragments and 
the ontology, together with their interconnecting links, form the structural part 
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Fig. 1 . Components of the virtual hyperbook model and the reading/writing interface. 



of the hyperbook. The interface specification specifies how to assemble the infor- 
mation fragments, with the help of the domain ontology, to produce a hypertext 
that constitutes the hyperbookis user interface. 

2.1 Structural Part of a Hyperbook 

The hyperbook structure is shown in Fig. 2 as a set of classes and associations 
(expressed in UML). The structural part of a particular hyperbook is a set of 
objects that are instance of these classes. Classes OF_Link and FF_Link are 
associative classes that represent the links between the domain ontology and 
the fragments and between fragments. Links between fragments can have differ- 
ent natures, such as: structural links (from fragments to sub- fragments); argu- 
mentative links (arguments, positions, contradictions, ...); narrative or rhetoric 
links (elaboration, summary, reinforcement, ...). The domain ontology is a set 
of concepts connected through semantic relations that have a type and possibly 
restrictions (such as number restrictions). Every concept is connected to one or 
more terms. 




Fig. 2. Classes of the hyperbook structure. 



The domain ontology plays two roles. On one side it describes the concepts of 
the domain. On the other side, it serves as a reference to describe the information 
content of the fragments. By establishing typed links from fragments to concepts, 
one can qualify not only what the fragment is about but also what relationship 
it has with the domain concepts. Typical link types are: 
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— instance, example, illustration: the fragment describes a particular instance 
of the referred concept 

— definition: the fragment contains a textual (or audio, or graphical) definition 
of the concept 

— property: the fragment describes a property of the concept 

— reference, use, required: the fragment refers to the concept (it is necessary 
to know the concept to understand the fragment) 

2.2 Interface Specification 

The interface of a virtual lryperbook is a (real) hypertext, made of nodes and 
links, derived from the informational structure according to an interface speci- 
fication. An interface specification is a set of node schemas, the instantiation of 
which produces the real nodes (XML documents) and links of the interface. A 
node schema is an expression of the form 

node name [ parameters ] 

content andJinkspecification 
from selection-expression 

A selection expression is a path expression with attribute conditions. A path 
expression is a sequence E\ , . . . , E n where each Ei is a path element is of the 
form class-name variable [condition] or of the form - (association-name vari- 
able [condition] )->. The evaluation of such an expression yields an n-tuple of 
interconnected objects that belong to the classes of the lryperbook structure 
(fragments, concepts, ontology-fragment links, etc.) and that satisfy conditions 
on their attributes. For instance, the expression 

Concept c -(0F_Link k [type=" example "]) -> Fragment f 

specifies the set of triples (c, k, /) such that c is a Concept , k is an OF_Link with 
A:. type = example , / is a Fragment, and fc.from = c and k. to = /. In other words, 
it selects all the concepts and fragments that are connected through an ontology- 
fragment link of type example. The path expressions can be abbreviated by 
omitting the associative class names when there is no ambiguity. Moreover, when 
the condition has the form type = type-name, we will simply write type-name. 
Hence, the above path expression will be written as 

Concept c example "-> Fragment f 

The content specification of a node is a list of XML elements that may con- 
tain string constants, object attributes, or expressions with string or arithmetic 
operators. 

Example 1. The following node schema selects all the fragments that are linked 
to a given concept C. The content of a node instance is comprised of 

— a title element that contains the main term associated to concept C 

— for each fragment F connected to C through a link L: the link type and the 
fragment’s title and content 
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node connected_f ragments [C] 

<title> Fragments connected to, C.term </title> , 

{ <subtitle> L.type, : , F. title </subtitle> 

<text> F. content </text> 

> 

from Concept C -(L)-> Fragment F 

(The content specification between { and } is repeated for each selected n-tuple 
of objects) 

The actual presentation of a node instance of this schema will be determined by 
XSLT or CSS style sheets. 

2.3 Ontology and Link Inference 

The links between the ontology and the fragments play a crucial role to establish 
relevant links between fragments and to generate interface documents. The idea 
is to replace direct linking between fragments (often called horizontal linking) by 
inferred links that correspond to paths starting from a fragment, going through 
one or more ontology concepts, and ending on another fragment. Inferred links 
are preferred to direct links because users (authors) are generally able to establish 
correctly typed links from the fragments they write to the relevant concepts. 
But when they are asked to link their fragments directly to other fragments 
they have difficulties finding relevant fragments to link to and deciding on what 
type of links to establish. Since the ontology has a graph structure, semantically 
meaningful links can be obtained by simple inference rules that consist in path 
expressions. For instance, Fig. 3 shows two derived links (1) and (2) obtained by 
going up into the domain ontology and then down to another fragment. In the 
next section we will present the path expression language. 




Fig. 3. Classes of the hyperbook structure. 



Example 2. The following node schema has a selection expression that corre- 
sponds to the above-mentioned link inference path (1). An instance of this 
schemas will display the content of the fragment Ex and a list of hypertext 
links (href) to nodes (showFragment) that display other examples of the same 
concept C. The titles of the other examples are used as anchor text for the 
hyperlinks (QtherEx . title). 
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node exampleAndOthers [Ex] 

Ex. content , + , 

"Other examples of concept ", C.term , 

{ 

href showFragment [OtherEx] ( OtherEx . title ) 

> 

from Fragment Ex <-"example"- Concept C 
example "-> Fragment OtherEx 

This same link inference mechanism will be used to generate links accross lryper- 
books. In addition to the standard hyperlinks shown on the previous example, 
the interface specification language also provides inclusion links. Inclusion links 
enable the interface designer to create complex contents that show several frag- 
ment contents together in a single hypertext node. Had we used inclusion in this 
example, we would have obtained a single node (document) showing an example 
together with all the other examples. 

3 A Multi-point of View Approach 
to Hyperbook Integration 

With the objective of creating interface documents for reading not only virtual 
documents, but also for accessing a whole library, we could choose a very direct 
approach that consists of integrating all hyperbooks into one large hyperbook. 
In this case, we must create a global ontology from the available lryperbook 
ontologies. However, this approach is very limited because: 

— It forces to strongly unify the concepts that does not conduct to a prob- 
lem for the well established terminology of a domain, but which might be 
problematic when concepts have vague environments or when there remains 
divergent and contradictory interpretations. 

— It does not reflect the fact that each book represents the point of view of an 
author on a subject. This diversity of point of views would be lost. 

— It loses the diversity of narrative styles (reflected in the interface documents) 
adopted by the different authors. This is why we propose an approach based 
on multi-point of view ontologies [8,9]. 



3.1 Concept Conflicts and Point of Views 

Gaines and Shaw [12] propose a methodology to compare conceptual systems of 
several domain experts. This method is based on analyzing domain entities (con- 
cepts), terms used to design them, and their attributes. The aim is to highlight 
the divergences between experts in order to facilitate the discussion to obtain a 
consensus. This analysis can lead to four different situations: 

Consensus. The experts use the same term for describe the same concept. 
Correspondence. Different terms are used for the same concept. 




106 Gilles Falquet, Claire-Lise Mottaz-Jiang, and Jean-Claude Ziswiler 



Conflict. The same term is used for different concepts. 

Contrast. The experts identified different concepts and use different terms to 
name them. 

In a situation of conflict the domain experts must work together to reach a con- 
sensus, i.e. to define a single concept that correspond to the term in question. In 
a multi-point of view approach, the resolution of conflicts is carried out differ- 
ently, by considering that there can exist several concept definitions associated to 
the same term, provided they belong to different point of views on the domain. 
When integrating lryperbook ontologies we will consider that each hyperbook 
represents a point of view on its domain. Since the lryperbooks may belong to 
completely different domains, we will find the following three situations: 

— Two concepts designated by the same term do not belong to the same se- 
mantic domain. For example, the concept ‘table’ of the furniture ontology 
and the concept ‘table’ of a ontology about databases. 

— Two concepts effectively belong to the same domain, but they have different 
definitions. The two definitions represent different point of views of this 
concept. 

— The definitions of the two concepts are considered to be equivalent. The 
point of views of the two lryperbooks coincide for this concept. 



3.2 Ontology Integration for Virtual Documents 

Since the objective of the integration process is to lead to a multi-point of view 
ontology and not just to a “monolithic” ontology, the most appropriated inte- 
gration techniques are those which establish links between concepts (mapping 
of ontologies). 

For this, we propose to use an extension of the technique of Rodriguez and 
Egenlrofer [23]. There, the similarity between two concepts is the weighted sum 
of three measurements: similarity of the terms (set of synonyms), similarity of 
the attributes (set of values) and similarity of the semantic neighbourhood (set 
of the concepts close to the semantic links in the graph). Moreover, the similarity 
function takes into account the difference of depth of the concepts (relative to 
their respective ontologies). 

In the case of virtual documents, we make use of additional information to 
evaluate the similarity between concepts thanks to the fragments related to each 
concept. If two concepts A and B are bound by links of the same type t to sets 
of fragments t(A) and t(B) respectively, the similarity between t(A) and t(B) 
can be taken into account in to compute the similarity between A and B. The 
similarity between t{A) and t{B) can be obtained with well-known document 
similarity measures (for instance, the cosine between the tf-idf vectors repre- 
senting the documents in the space of terms [24] or the Kolmogorov distance). 
Then, we define the similarity between t(A) and t(B) based on the similarities 
between documents (for example by taking the maximum similarity found be- 
tween all the fragments of t(A) and t(B)). The similarities obtained for all types 
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of links will then be added up to the similarity measure computed at the con- 
ceptual level. It is important to remark that link typing is crucial here. Indeed, 
the comparison makes sense only if the compared fragments play the same role 
with respect to a concepts. If, for instance, fragment a is an example of concept 
A whereas b is a counter-example of H, a strong similarity between a and b does 
not imply a strong similarity between A and B , on the contrary. 

4 Generation of Interface Documents for Libraries 

An interesting characteristic of the virtual hyperbook model and of the integra- 
tion model is the possibility of re-using specifications of virtual interface docu- 
ments to create global reading interfaces. 

A first technique for building a global interface consists in re-using the speci- 
fication of a hyperbook interface, but to apply it to the whole information space 
of the library, i.e. to the fragments and ontologies of all the hyperbooks and 
their interconnections through similarity links. A new, extended, version of each 
node schema of the interface is derived by extending its selection expression as 
follows: 

Each element of the form 

Concept c 

is replaced by 

Concept c [hybook = L] - ("sim") -> Concept c’ 

Thus every path through c can now “jump out” to another hyperbook as shown 
in Fig. 4. 




Fig. 4. Selection path to another hyperbook. 



The initial book is thus enriched with other point of views of the subject. 
The following node schema is the extension of the schema shown in Example 2. 

node Extended_exampleAndOthers [Ex, threshold] 

Ex. content , + , 

"Other examples of concept ", C.term , 

{ href Fragment [OtherEx, R. value] ( OtherEx . title } 
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from Fragment Ex <-"example"- Concept C 

-(R [type="sim" and value>threshold] ) -> Concept C2 
example "-> Fragment otherEx 

By adjusting the threshold value, the user can define the type of extension he or 
she desires. A very high threshold corresponds to an extension with very close 
point of views while a lower threshold accepts dissimilar point of views. A second 
way to re-use an interface specification consists of applying the interface specifi- 
cation of a hyperbook to another one. In this case, we will see the informational 
content of one hyperbook with the interface of another. If we consider that the 
interface of a hyperbook represents its narrative style, we obtain a vision of the 
content of a lryperbook in the style of another one. This kind of re-use does 
not require any rewriting of node schemas, but it implies that the hyperbook 
ontologies use the same types of relations. 

It is also possible to define a completely new reading interface on the whole 
library. In this case, we suppose that an author wants to create a new book 
starting from information already existing in the library. This is a second level 
author, who will not create information, but invent new narrations and presen- 
tations. This task can be achieved either by creating new node schemas, or by 
re-using schemas of different hyperbooks. As we have already seen, each node 
schema can be applied to any hyperbook. As a consequence, a second level au- 
thor can create new schemas that include or refer to existing schemas, without 
having to modify the latter. 



5 Conclusion and Future Work 

In this paper, we have presented a virtual hyperbook model and an ontology 
integration approach adapted to the reading of hyperbooks in a digital library 
environment. Each lryperbook has its own domain ontology and hypertext in- 
terface specification. Our approach to ‘globally’ reading in a library of virtual 
lryperbooks is based on the idea that each lryperbook corresponds to a point 
of view on a domain. By applying integration techniques to the hyerbookis on- 
tologies, we can create a multi-point of view ontology that describes a set of 
lryperbooks. A hypertext interface specification language can use this ontology 
to infer semantic relations between informational fragments and to construct 
new semantically and narratively coherent documents that are based on the 
content of several lryperbooks. Thus, a user will read something that is more 
than each individual book in the library. We have created various implemen- 
tations of lryperbooks using techniques of hypertext views on databases. The 
domain ontology, the lryperbook ontology and fragments are stored in a rela- 
tional database (the class diagram of Fig. 3 can be readily translated into a 
relational database schema). The node schemas of the interface are specified in 
the Lazy language, which is a declarative language to specify hypertext views on 
relational databases [10]. This language corresponds to the node schema specifi- 
cation language presented in this paper. With this technology, we have developed 
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a hyperbook management system for courses. Every course has its own lryper- 
book. The reading interface provides the user with different views to help him or 
her grasp the meaning of concepts and see the direct or indirect interconnections 
between the courseis concepts. We are currently using the integration techniques 
described here to define global reading interfaces for these hyperbooks. We plan 
to implement different similarity measures and compare them on hyperbooks in 
different domains. 
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Abstract. Digital libraries are increasingly based on digital page im- 
ages, but techniques for constructing usable versions of these page images 
are largely folklore. This paper documents some issues encountered in 
creating various kinds of renderings of page images for the UpLib digital 
library system, and suggests approaches for each, based on both prob- 
lem analysis and user feedback. Several factors important in determining 
useful sizes for small visual representations of the documents, called doc- 
ument icons, are discussed; one algorithm, called log-area, seems most 
effective. 



1 Introduction 

The UpLib personal digital library system provides a secure long-term storage 
and retrieval system for a wide variety of personal documents such as papers, 
photos, books, clippings, and email. It is suitable for collections comprising tens 
of thousands of documents, and provides for ease of document entry and access 
as well as high levels of security and privacy. It is highly extensible through user 
scripting, and is also intended to be useful as a platform for further research into 
digital libraries and computer-augmented reading. The general architecture and 
design of UpLib is more fully described in [6] and [5] . 

UpLib creates a searchable repository accessed through an active agent via 
a Web interface. This interface is highly visual, displaying documents as docu- 
ment icons (figure 1), laid out in a two-dimensional space, for the user to select 
from. The built-in reader application also uses thumbnail images of each page 
(figure 5), in two sizes, for various purposes, but primarily for reading. Other 
applications, such as the corpus browser discussed in [2], also use the thumb- 
nails in their interfaces. In the process of designing these interfaces for UpLib, 
a number of issues arose regarding effective generation of document icons and 
page thumbnails. The rest of this paper presents these issues, and discusses our 
approach to resolving them. 



R. Heery and L. Lyon (Eds.): ECDL 2004, LNCS 3232, pp. 111-121, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Fig. 1 . Document icons in an UpLib overview. These icons are generated using a 
constant-area algorithm, which allocates the same amount of area to each thumbnail, 
regardless of orientation. In this interface, clicking on an icon opens that document in 
a reader program. 



2 Document Icons 

In the user interfaces of many digital library systems, when a selection of docu- 
ments is shown to the user, it is presented as lines of text [12], [11]. Sometimes 
these lines contain document titles, sometimes they contain descriptions of the 
documents, sometimes they include information such document size or owner 
or creation date. In many systems, these description lines also include small 
icons, typically indicating the document format or genre, such as “folder” or 
“Word document” or “web page” . This is also the presentation system used on 
Windows and MacOS desktops, for Google search results, and for many other 
multi-document systems. The SOMLib system uses generic icons of books, col- 
ored to indicate genre, but otherwise connected to the content of each icon only 
with textual labels on the icons [8]. 
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In the UpLib system, however, the primary display mode for a selection of 
documents is as as a set of graphical document icons, as shown in figure 1. These 
icons are not genre icons; rather, they are intended to remind the user of the 
source document, in terms of appearance, shape, and size. They are small but 
compelling visual representations of the documents they represent. In constrast 
to more general-purpose digital library systems, UpLib is intended for personal 
use by an individual; almost all the documents in any collection have already 
been seen (and often handled) by the user of the system. This reinforces the abil- 
ity of visual representations to remind the user of the content of the associated 
document. The use of document icons also capitalizes on the human perceptual 
preference for pictures over words when locating an item amidst a number of 
other similar items [7]. 

2.1 Computing Document Icon Size 

To visually represent the physical document in the digital space, it is usually 
necessary to scale the size of the document page to a thumbnail representation. 
It is customary to anti-alias the scaled image, and to preserve the aspect ratio of 
the image. However, it is less clear what an appropriate size for the icon should 
be. 

In UpLib, page thumbnails are generated for each page of a document (see 
section 3.1). These are typically constrained to fit in a particular rectangular 
region, as part of the reading system showin in figure 5. In the first implementa- 
tion of UpLib, no special document icons were generated; the page thumbnail for 
the first page of the document was used as its iconic representation. This posed 
some interesting problems, and led to a series of algorithms for determining icon 
sizes. 

Figure 2 shows seven representative documents arranged in four rows, each 
row illustrating a particular icon sizing algorithm. From left to right, the docu- 
ments are a scanned store receipt, actual size 63.8x82.8 mm; an A4 technical pa- 
per, actual size 210x297 mm; a US-letter saved Web page, actual size 216x279.4 
mm; a Powerpoint presentation, actual size 279.4x216 mm; a map, actual size 
355.6x279.4 mm; a scanned newspaper clipping, actual size 99.1x222.7 mm; and 
a photograph, actual size 162.6x121.9 mm. The top row of the figure shows the 
effects of the original algorithm, using page thumbnails. While this algorithm 
worked well for the pages of a single document, all of which were the same size, 
it does not perform particularly well for a juxtaposed assortment of documents 
of different sizes. In particular, landscape-oriented documents were shortchanged 
in the display, when arranged near portrait-oriented documents. 

The second row of the figure shows our first attempt at rectifying this prob- 
lem. A special “document icon” was generated for each document, instead of 
simply using the thumbnail of the first page. Instead of using a portrait-oriented 
rectangle to size the icon, we use a square. You will note that the Powerpoint 
document now receives as much display area as the paper to its left. 

However, this algorithm fails to preserve some salient physical differences. 
Users complained that they were unable to distinguish A4 papers from US- 
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Fig. 2. This figure illustrates five possible icon size generation strategies. At the top is 
constant-rectangle, which attempts to scale each icon to fit in a constant-size rectan- 
gle. The second row illustrates constant-square, a version of constant-rectangle which 
provides parity for landscape and portrait mode documents. Both of these suffer from 
the inability to quickly distinguish US-Letter from A4 documents by their primary 
distinguishing characteristic, height. This problem is addressed by the constant-area 
algorithm shown on the third row, which allocates the same amount of area to each 
document icon. However, this algorithm tends to distort the relative sizes of documents. 
The fourth row illustrates the linear algorithm, in which each icon has the same size 
relative to the others as the physical document would. With this algorithm, large doc- 
uments such as maps, posters, or blueprints dominate the display. Finally, the log-area 
algorithm in the bottom row provides the same amount of area as the linear algorithm 
for an US-Letter document, but allocates more area (about four times more) to the 
small receipt, and somewhat less area to the large map (about two-thirds of the linear 
algorithm). 

Letter papers, even though the A4 document icon shown is somewhat “thinner” 
than the US-Letter icon. Apparently, the memorable characteristic of the A4 
page size is that it is somewhat “taller” or longer than US-Letter; our iconic 
representation did not preserve that relationship. 
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The third row of figure 2 illustrates a different algorithm that gives each 
document icon an equal amount of display area, while preserving the document’s 
aspect ratio. The A4 document can be clearly distinguished from the US-Letter 
document by height. However, this still masks other relative differences in size. 
The small sales receipt and small photograph seem to be as large as the US- 
Letter document. The large map seems to be the same size as the Powerpoint 
presentation, while the relatively small newspaper clipping towers above the 
other icons. 

To address these issues, we looked for a sizing algorithm that would preserve 
some elements of the relative sizes of the documents. A strict linear reduction 
in size, shown in the fourth row of figure 2, would be problematic, as it would 
tend to make very large icons for very large documents, and vanishingly small 
ones for very small documents. We decided to use a smooth non-linear function 
of the area of a document to calculate the icon size. To increase the size of small 
documents, and reduce the size of very large ones, we chose a function relatively 
linear around the area of a US-Letter document, which would still reveal small 
differences, such as those between A4 and US-Letter, but which would still result 
in smaller icons for smaller documents, and larger ones for larger documents. 




Fig. 3. The linear and log- area scaling functions. 



The graph in figure 3 shows this function, which has the following formula: 



factor =J In ( aread ~ ^ + l) (1) 

y CLTCCLu S —Letter 

To calculate the amount of area to allot to each icon, this factor is multiplied by 
the amount of area that would be given to the icon for a US-Letter document. 
The bottom row of figure 2 shows the result of sizing the document icons in this 
manner, which we call log-area. While this scaling algorithm seems preferable to 
the others, all five are available in the current UpLib system, and user-selectable 
via the configuration mechanism. 



116 



William C. Janssen 



2.2 Document Icon Decoration 

Enhancement of document icons used in task-oriented applications has been ex- 
amined by Woodruff et. al., [13]. That application constructed custom document 
icons to display search engine results to users. Each icon was a rendering of the 
result Web page, and text items that matched the query terms were exaggerated. 
This differs from the use of icons in UpLib in several ways: the rendering was 
only one of many possible renderings of the same Web page, the user may never 
have encountered that page before, and the text labels exaggerated on the page 
were specific to that search. However, the general idea of using exaggerated text 
on document icons to improve recognizability seems a useful one. 






Fig. 4. Document icons with labels. 



UpLib document icons support colored text labelling, in a deliberately limited 
fashion. Users can add multiline colored labels to document icons using a single 
size of a single font (chosen to be unlikely to appear similar to fonts in common 
use), by defining the “document-icon-legend” metadata element. Figure 4 shows 
an example of this type of document icon. This is particularly useful for technical 
papers using small fonts, email messages, newspaper web pages, or any other 
document that follows a standardized layout so that all documents in that layout 
appear quite similar. 

This approach can also be extended to non-text labels. The rightmost icon in 
figure 4 shows a mockup of an icon for an email message in an UpLib repository. 
A picture of the sender has been located in a “biff” library of email senders, and 
pasted into a whitespace area of the original page image. In addition, a label 
has been automatically calculated for the mail message from information in the 
mail headers. 

E-mail raises some questions about the appropriate granularity of documents. 
Is it a good idea to make each mail message a separate document, or is it more 
useful to compose a single document from all of the messages that make up a 
thread in an email discussion? If so, an appropriate document icon may consist 
of a graph of the thread tree, rather than a picture of the first sender. Simi- 
larly, a related series of photographs might be stored as a single document, with 
concomitant considerations for the document icon. A document similar to the 
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WebBook proposed by Card et. al. [4] could be built from a set of Web pages 
traversed in a single browsing session; an icon for such a book might be a graph 
of the session, similar to the “Web behavior graphs” discussed in [3]. The Web 
caricatures work [14] developed a feature analysis of a Web page, then used that 
analysis to drive construction of an iconic representation for that page. 

Other icon decoration possibilities are possible. The DocuWorks system [1] 
incorporates cartoon decorations related to the desktop metaphor. Multi-page 
documents are distinguished from single-page documents with a small binder 
clip cartoon in the upper left corner. Multiple documents “bound” together are 
distinguished from individual documents by the addition of the cartoon of a 
spiral binding on the left side of the document. These decorations typically act 
as active controls; for example, the binder clip has left and right arrows that 
allow you to page through the page thumbnails of the document, and clicking 
the clip itself will fan out the pages of the document. 

3 Page Images 

The UpLib document reading subsystem makes certain assumptions about the 
economics of the computing environment. It assumes that disk storage on the 
order of gigabytes is very cheap (though not free, due to the overhead cost 
of backups); that the average communications speed is relatively fast, at least 
802.11b speeds; and that display screens are fairly tall, at least 1024 pixels in 
height. A tablet-PC, for example, in portrait mode has a screen that is at least 
768 pixels wide and 1024 pixels high. These assumptions are partially due to Up- 
Lib’s design for personal use: a personal library will have fewer documents than a 
community library, reducing storage requirements; the document repository will 
frequently reside on the machine the user is using it on, reducing communication 
overhead. 

As a result of this calculation, the reading subsystem uses page images as 
the primary presentation form of the document (figure 5) . These are anti-aliased 
reduced-resolution versions of the document’s page images, sized to fit on the 
screen, and optimized for reading. In addition, a small thumbnail version of each 
page is generated, for use in document overviews. This thumbnail also contains 
an oversized page number for that page. This section discusses appropriate gen- 
eration of these two types of document page images. 

3. 1 Page Thumbnails 

Small page thumbnails are used to show an overview of the pages of the docu- 
ment. Examples of this usage are shown in figures 5 and 6. They allow a user to 
locate graphically distinctive pages, containing diagrams, maps, or photographs, 
easily. They can also be used to provide some context for the particular page 
the reader is on. They are sized to be significantly smaller than document icons. 
This allows more of them to be presented to the reader without scrolling. For 
most slideshows or technical papers, all of the document’s page thumbnails can 
be presented without scrolling in a display such as that shown in figure 6. 
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Fig. 5. An “open” document in an UpLib viewer. The small page icons on the left-hand 
side provide direct access to that page; when used on the 768 x 1024 pixel screen of 
a Tablet PC, they are partially occluded on the right side, but the page numbers are 
still visible. 




Fig. 6. The small page thumbnails are used for overviews of a document, as shown here 
and in figure 5. The highlighted thumbnail shows that the current page is page 22. Gross 
graphical detail on other pages is discernible. 



Each page thumbnail is numbered on the left top of the image. In a frame- 
based display such as that shown in figure 5, this allows the frame to be dragged 
to the left, occluding the right side of the page thumbnail, but providing more 
space for the main page image, and still showing the page number on the page 
thumbnail. 
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3.2 Large Images 

UpLib uses “large thumbnails” in its primary reading interface. These are anti- 
aliased versions of the original high-resolution page images, scaled to fit on a 
typical display. The actual display size is user-configurable; by default, each 
large thumbnail is constrained to fit in a 680x880 pixel rectangle. This allows 
display on the 768x1024 pixel screen of a tablet-PC in portrait mode. The image 
format is also important for readability. A text page scaled to size and stored as 
either PNG (compression level 8) or JPEG (quality 75%) will be about the same 
size, but the JPEG version will exhibit ghosting effects around the characters, 
making it harder to read. 

Human response times must be considered for Web-based user interfaces. 
Typically, actions that complete within about 100 msec are seen as instanta- 
neous, and users become impatient if actions do not complete within about 
700-1000 msec (see [10], [9]). If we assume a transfer rate of about 2Mbps (a rea- 
sonable average for 802.11b), and a maximum allowable user delay of 700 msec, 
the page image should be no larger than about 170 KB. Allowing for decom- 
pression overhead, a top size of about 150 KB is a good target. Faster transfer 
speeds and user-side caching can be used to increase this limit. 
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Fig. 7. A page from a paper, before and after whitespace cropping. Both versions are 
scaled to fit in a 680x880 pixel rectangle, but the one on the right is significantly more 
readable. 
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Another concern is the relationship of the presented document size to the 
original document size. Our original implementation scaled each document to 
fit within a fixed 680x880 rectangle. This had the unforeseen effect of blowing 
up small documents, such as the cash register receipt shown in the first column 
of figure 2, to very large scales, so large that pixelization of the image impaired 
its readability. In addition, users had difficulty recognizing it for what it was, 
since they were used to the small size of the physical artifact. To counteract this 
effect, a sizing governor was used to limit the maximum size for a document. This 
governor assumes a display resolution of about 100 ppi, and scales documents 
for display at that resolution, if possible, or for a lower resolution if necessary. 
This means that the large thumbnail of a document 5cm on a side, scanned at 
300dpi, would be only about 197 pixels on a side rather than 680 pixels; that is, 
it would be rendered at about its normal size on a 100 ppi display. 

Larger documents are of course reduced more to fit in the bounding rectangle. 
For a letter-size document, there are about 80 ppi to work with. For 9-point 
text, this is about 10 pixels per line. With clean text, this is often good enough. 
However, since ppi is a ratio of total pixels to document size, it is possible to 
increase the apparent resolution not only by increasing the number of pixels, but 
also by decreasing the document size. We take the latter approach, by trimming 
excess background-color space around the text of the document. Figure 7 shows 
the results of this approach on a document page. It is important, when computing 
this type of cropping, to compute a single cropping box that will work across 
all the pages of the document, and apply this box to all pages uniformly. If 
the cropping box is instead determined and applied on a page-by-page basis, 
flipping from one page to the next may resize or recenter the text, causing minor 
perceptual dissonance which can impair reading. 

4 Conclusion 

With recent increases in storage capacity and network bandwidth, digital library 
systems based on page images, either scanned or generated, are becoming more 
popular. It is possible to optimize iconic representations of these page images 
for visual search and retrieval purposes, using the techniques outlined above. We 
recommend the log-area algorithm for producing appropriately scaled document 
icons for selection from a set of documents with heterogeneous sizes. Addition- 
ally, scaled page images can be used for document reading on tablet-PC-sized 
display surfaces, if appropriate attention is paid to size issues. In particular, the 
technique of removing page borders to increase effective resolution seems to work 
quite well. 
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Abstract. The Digital Library (DL) field is one of the most promising areas of 
application for information visualization technology. In this paper, we propose 
a visual user interface tool kit for digital libraries, to deliver an overview of 
document sets, with support for interactive direct manipulation. Our system, 
Citiviz, employs a dynamic hyperbolic tree to display hierarchical relationships 
among documents, based on where their topics fit into the ACM classification 
system. Also, Citiviz provides an interactive, animated 2-dimensional scatter 
plot. With it, users may gain insight by changing various parameters, or may di- 
rectly jump to a particular document based on its label or location. According to 
a preliminary evaluation, our system shows advantages in performance and user 
preference relative to traditional text based DL web interfaces. 



1 Introduction 

The Computing and Information Technology Interactive Digital Educational Library 
(CITIDEL, http://www.citidel.org), part of the NSDL (National Science Digital Li- 
brary, http://www.nsdl.org), uses OAI-PMH (the Open Archives Initiative Protocol 
for Metadata Harvesting) to harvest resource metadata from its member collections. 
Those member collections are other digital libraries (DLs) that share their resources 
with CITIDEL, which provides integrated browsing and searching services. Users can 
browse separately through each member collection, or can browse through the union 
collection using any of four different classification schemes. Nevertheless, the pri- 
mary means to access CITIDEL is through searching. Unfortunately, if users are un- 
familiar with the topic of their search, or lack experience regarding search tactics, 
relevant documents may only appear frustratingly far down in a ranked list of search 
results. Fortunately, visual interfaces to DLs apply powerful data analysis and infor- 
mation visualization techniques to generate visualizations of document collections in 
DLs, with possible beneficial effect on browsing and searching. Thus, we have inte- 
grated text mining and information visualization to develop a visual interface to 
CITIDEL. 

Visualization techniques of one broad category consider predefined document at- 
tributes, such as author or date, and query relevance. One example is the Envision 
interface [14, 22]. It can organize search results according to metadata along the X 
and Y-axes, and show values for attributes associated with retrieved documents 
within each cell. However, the view provided by the original version of the Envision 
interface gave few cues about how the documents are related to each other in terms of 
their content and meaning. 
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Visualization techniques of another category do not make assumptions regarding 
document attributes. They automatically derive a collection overview through unsu- 
pervised learning, which usually is based on inter-document similarities. Scat- 
ter/Gather [3, 7] is such a system that applied document clustering approaches to 
browsing and searching. However, the representation of document clusters by Scat- 
ter/Gather is textual, not graphical. 

Reflecting upon the above two different types of visualization techniques has led 
us to the following research questions: 

1 . How should we combine the two different types of visualization techniques to 
develop a visual interface to CITIDEL for post-retrieval analysis? 

2. What text mining technology should we use to explore the inter-document simi- 
larities for online document collections that are dynamically created, such as the 
set of retrieved documents from a search engine? 

3. What are the insights supported, and how are they supported? 

4. What interaction and navigation strategies should we use to facilitate visual 
browsing and analysis? 

To address the above questions, we 

1. Developed clustering components to discovery document relationships and to 
identify subject categories for retrieved documents. 

2. Developed a visual interface, called Citiviz (http://feathers.dlib.vt.edu/ CitiViz/), 
for post-retrieval analysis, initially for CITIDEL, following the guiding princi- 
ples of Resnikoff [18] and Shneiderman [20]. Resnikoff observed that the human 
eye and other biological systems process the vast amounts of information avail- 
able in the real world by smoothly integrating a focused view, for details, with a 
general view, for context. Shneiderman advocated an interaction model in which 
the user begins with an overview of the information to be worked with, then pans 
and zooms to find areas of potential interest, and then views details. The follow- 
ings are interaction and navigation methods we implemented. 

■ Use aggregation by document clustering as an overview strategy. 

■ Use the “focus + content” (fisheye) scheme to visualize a hierarchical graph of 
a concept map representing subject categories of retrieved documents. 

■ Combine tree graphs with scatter plot graphs. Documents attached to nodes of 
a tree graph can be visualized in a 2D space. 

■ Integrate a 2D scatter plot graph with a network of citations. Documents of se- 
lected clusters are scatter plotted in a 2D space and connected by citation rela- 
tionships. 

■ Apply the aggregate towers technique [16] to solve occlusion problems of 
documents visualized in the scatter plot graph. 



2 Related Works 

Visual interfaces to DLs apply powerful data analysis and information visualization 
techniques to manage document collections in DLs. They exploit human vision and 
spatial cognition to help humans mentally organize and electronically access and 
manage large, complex information spaces [1]. They have common usage scenarios 
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supporting searching and browsing for DLs. Further, visualization of search results 
has much in common with gaining an overview of the coverage of a DL to facilitate 
browsing. Both enable the user to become oriented, and to find relevant information. 
They differ mainly in two respects. First is the origin of the document sets (a pre- 
existing static collection, or result set dynamically retrieved from a search engine). 
Second is the information available that relates documents to user information needs. 

Thus, first, we consider visualization based on predefined document attributes such 
as author or date, along with query relevance. In Section 1 we discussed Envision [14, 
22]. Here we broaden the discussion to include semantic information. Cougar [5] and 
Cat-a-Cone [6] display semantic information (categories assigned to each document) 
to users. Categories also can be visualized as a Hyperbolic tree [12] or a SpaceTree 
[15] as well as through a traditional node-link representation of a tree. Cat-a-Cone 
used ConeTree [19] to display the category labels of the documents retrieved, while 
the retrieved documents are organized as pages in a WebBook [2]. Another example 
is Map.net (http://map.net/start). It provides hierarchical (multilevel/categorical) in- 
formation maps for browsing over two million Web sites from the Open Directory 
Project (http://dmoz.com). Rather than using conventional search engine technology 
to navigate the Web, it creates a landscape that spatially represents data relationships, 
though in a very abstract, geometric fashion. Size and position of areas on the map 
indicate number of documents in respective categories and mutual relations between 
them. Users of this kind of interface gain an immediate overview of available catego- 
ries and the number of documents these categories contain. 

Document-query relevance was visualized in TileBars [4] and VIBE [1 1]. TileBars 
showed how query terms appear within individual documents, while VIBE displayed 
an overview of the retrieved documents according to which subset of query terms they 
contain. 

Often there are more than two predefined document attributes. Visualizing multi- 
attribute sets can be seen as visualizing multidimensional data sets. Techniques for 
visualizing multidimensional data include pixel-oriented, geometric projection, icon- 
base, hierarchical, and graph-based techniques [9]. The basic idea of pixel-oriented 
techniques is to map each data item to a colored pixel, while icon-based techniques 
map each data item to an icon. A well-known representative of hierarchical tech- 
niques is Treemaps [8]. Graph-based techniques effectively present a large graph 
using specific layout algorithms, query languages, and abstraction techniques. 

Visualization techniques in the second category introduced at the start of this sec- 
tion do not make assumptions regarding document attributes. They automatically 
derive a collection overview via the use of text mining, often through document clus- 
tering or neural networks. Examples are Scatter/Gather [3, 7], Grouper[24-26], Gal- 
axy of News [17], Vivisimo (http://vivisimo.com), Kartoo (http://kartoo.com), High- 
light (http://highlight.njit.edu/technology.htm), SOM [10, 13], ThemeScapes [23], and 
Mooter (http://mooter.com:8080/moot). 

Occlusion is one of the important issues in information visualization. The Envision 
system [14, 22] solves this problem by using a flexible table that resizes its cells ap- 
propriately. On the other hand, the aggregate tower technique [16] avoids occlusion of 
objects by stacking objects together, creating towers of objects. 

Grouper was a dynamic clustering interface to web search results. It introduced the 
Suffix Tree Clustering (STC) algorithm. Vivisimo is a web search clustering inter- 
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face. Its algorithm is based on an old artificial intelligence idea: a good cluster or 
document grouping is one that possesses a good, readable description. Kartoo is a web 
interface organizing search results retrieved from relevant web search engines by 
topics that displays them on a 2-dimensional map. Theoretically, Kartoo provides a 
node-link graph. A document (Web page) node is presented by a ball. The size of the 
ball corresponds to the relevance of the document to the query. Links are labeled with 
sets of keywords shared by related documents. Another example of visualization 
techniques of this category is self-organizing map (SOM). SOM is a neural network 
algorithm that takes a set of high-dimensional data and maps them onto nodes in a 2D 
grid. Shifting to 3D, the ThemeScapes view imposes a three-dimensional representa- 
tion on the results of clustering. The layout makes use of “negative space” to help 
emphasize the areas of concentration where the clusters occur. 

Combining visualization with text mining could lead to novel discovery tools [21] 
Examples are commercial tools such as SAS JMP (http://www.sas.com), Spotfire 
(http://www.spotfire.com), and SPSS Diamond (http://www.spss.com). 



3 System Design 

To address the research questions raised in Section 1, and building upon related work 
(Section 2) and our prior work with CITIDEL and Envision, we have developed 
Citiviz, according to a component based design. Communication between components 
is XML based. There are three types of components. They are Data Source Compo- 
nents, Clustering Component, and Visualizing Component. The first two were imple- 
mented and then wrapped into Java servlets to enable web access. The Visualizing 
Components, also implemented in Java, communicate with those servlets in XML. 



3.1 Data Source Components 

Data Source Components send queries to CITIDEL or other DLs, and parse the re- 
trieved HTML pages into XML files, conforming to XML schemas we developed. 
Those XML files are then transmitted to the Clustering Components for processing. 



3.2 Clustering Component 

Clustering Components are implementations of different document clustering algo- 
rithms. We developed a new clustering component to supplement the clustering com- 
ponents of Carrot2 [26] that have been incorporated into our system. 



3.3 Visualizing Component or Citiviz 

Citiviz applies two major visualizing techniques - a hyperbolic tree of a hierarchical 
concept map and a 2D scatter-plot graph. The initial interface is shown in Figure 1. 

The top right of the screen is a hyperbolic tree based on the ACM Computing 
Classification System (1998 Version, CCS1998, http://www.acm.org/class/1998/). On 
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Fig. 1. Initial Interface of Citiviz 



the top left is a query box. By default, a user will retrieve results from a member DL 
(e.g., “ACM DL”) of CITIDEL. A user also has an option to retrieve results from all 
CITIDEL member DLs. In the middle right of the screen, there is a 2D scatter-plot. At 
the bottom right, there are fields for details of the attributes of a selected document. 
Citiviz supports exploring to gain insights, as is illustrated in the following three ex- 
ample scenarios. 

Examples of Insights Sought 

1. How are the retrieved documents clustered according to the ACM Computing Clas- 
sification System? 

2. How are the retrieved documents clustered according to inter-document similarity? 

3. Which cluster has the largest portion of the document collection? 

4. To what category does the 1st ranked document belong? 

5. Which document is cited most among the selected clusters of documents? 

6. Which documents cite a selected document? 

7. What’s the most recently published paper by a particular author? 

8. To what topics does a document belong? 

Scenario 1: Show Me the Retrieved Results from ACM DL 

A user inputs query “Information Visualization”. By default, Citiviz provides re- 
trieved results from the CITIDEL member DL named “ACM DL”. A hierarchical 
concept map organized according to the ACM Computing Classification System then 
is displayed as a hyperbolic tree on the top right of the screen. The node name repre- 
sents a category, and a bubble attached to a node represents a document collection 
belonging to that category. The size of a bubble attached to a node indicates the size 
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of the document collection clustered in that category. The hyperbolic tree supports 
“focus + context” navigation. After the user clicks “Show all data in the scatter plot” 
button, all the retrieved documents from ACM DL are scatter plotted in the 2D space 
as shown in Figure 2. 




Fig. 2. Visual Results of Scenario 1 




Each document is visually mapped to a tower of cylinders (see Figure 3). Each level 
of a tower represents a cluster to which a document belongs. The taller a tower is, the 
more categories the document belongs to. Moreover, clicking on a tower allows users 
to see detailed information for the selected tower, as shown in the bottom of the 
screen. (See Figure 2 and Figure 4.). On the left of the screen, there is a list of colored 
bars representing the categories that those retrieved documents belong to. Clicking on 
a bar allows users to see a list of documents belonging to the cluster represented by 
the clicked bar. Moving the mouse over a bar invokes an animation of blinking towers 
in the 2D scatter plot space. Those blinking towers represent documents belonging to 
the category visually mapped to a colored bar selected with the mouse. Towers in the 
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2D space can be arranged according to attributes of rank, date, and citations. The 
colors of the levels of a tower correspond to those categories to which a document 
belongs. A user can change the color of a bar to distinguish different categories. The 
color of a bar, the color of its corresponding level in all towers, and the color of its 
corresponding node in the hyperbolic tree are always synchronized. 
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Fig. 4. Detailed information for selected document 



Scenario 2: Show Me Papers Related to “Algorithm Analysis” 
and Published by “Donald Knuth”, from CITIDEL 

A user inputs query “Donald Knuth”. She selects option “Search for all collections”. 
The retrieved results from CITIDEL then are clustered, using suitable components. 
After the clustering, results are displayed as a hyperbolic tree. She navigates the hy- 
perbolic tree and finds that a category named “Algorithm” is of interest. She then 
clicks the purple bubble attached to that interesting category. This cause all the five 
documents belonging to this cluster to be plotted as five purple, 1 -level towers in the 
2D scatter plot space as shown in Figure 5(a). She continues browsing the hyperbolic 
tree and finds another interesting category named “Analysis”. She clicks the magenta 
bubble attached to the category named “Analysis”. This new category contains nine 
documents. Since there exist two papers that belong to both “Algorithm” and “Analy- 
sis” categories, the interface shows seven 1 -level magenta towers and two 2-story 
towers consisting one purple story and one magenta level as shown in Figure 5(b), 
instead of adding nine new 1 -level towers into the scatter plot. 

Scenario 3: Show Me All Papers Related to “Data Compression” 

That Are Cited by This Paper 

A user inputs query “Data Compression”. By default, she gets retrieved results from 
the CITIDEL member DL “ACM DL”. After she clicks “Show all data in the scatter 
plot”, all the retrieved documents from ACM DL are scatter plotted in the 2D space. 
When she clicks a tower representing the document with title “Data Compression”, 
citation links pointing to other towers are dynamically displayed on demand as shown 
in Figure 6. A link connecting two towers indicates a citation relationship between the 
two papers. That is, a pointed to paper is cited by a pointing paper. She then follows 
the link to get detailed information for the cited papers. 



4 Evaluation 

To evaluate the interface, we conducted a small usability study to suggest further 
improvements and determine whether or not the combination of the hyperbolic tree 
and the scatter plot helps users find a document easier and faster than using tradi- 
tional, web-based interfaces. Four Computer Science graduate students participated in 
this evaluation. 
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Fig. 5. a) Top, a 1-level document is selected, b) Bottom, where results for two categories are 
shown, the document selected has 2 levels 



The test consists of three sections. Each section was designed to measure different 
tools: Citiviz using ACM classification, Citiviz using Citiviz clustering component, 
and CITIDEL (www.citidel.org). In each section, participants were asked to complete 
four tasks. The tasks were designed such that they could be completed using any of 
the tools. During the test session, the order of tool use was permuted to avoid bias. 
The participants were asked to perform each of the following tasks: 
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Fig. 6. Scatter Plot shows Citations 

1 . Given an author and a topic, find a document published by that author and be- 
longing to that topic. 

2. Given an author and a publication year, find a document published by that au- 
thor and in that year. 

3. Given a title, find a document having that title. 

4. Find the most recently published paper. 

The results of the study are summarized in Figure 7. 

From the study, there is no significant difference between Citiviz (ACM classifica- 
tion) and Citiviz (Citiviz clustering) when users browse search results for a paper 
based on topic or title information (tasks 1 and 3). Flowever, Citiviz (Citiviz cluster- 
ing) helps users find a document faster than Citiviz (ACM classification) when users 
browse search result for a paper based on publication date (tasks 2 and 4). Based on 
our observations, the reason that users perform tasks faster when using Citiviz (Citiviz 
clustering) is that several users were confused by the concept of aggregate tower. As a 
result, it might be more difficult for users to identify documents in Citiviz (ACM 
classification), where documents usually are in towers consisting of several levels. 

In contrast to Citiviz using the ACM classification scheme, documents visualized 
using Citiviz ’s clustering component usually have one level. Thus, it is relatively easy 
for users to identify the publication date of a document. 

Unsurprisingly, Citiviz helps users find a document faster than CITIDEL, when 
users browse search results for a paper based on topic and publication date (tasks 1, 2, 
and 4). Citiviz is designed to visualize topic and publication date information graphi- 
cally by using a hyperbolic tree and a scatter plot. These features allow users to gain 
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more insight about document relationships based on topic and publication date infor- 
mation. In contrast, CITIDEL displays this information textually and individually. 
Users cannot see quickly the relationships among documents. However, CITIDEL 
helps users find a document faster than Citiviz when users browse search result for a 
paper based on title (task 3) because, in contrast to Citiviz, CITIDEL displays search 
result as a list of titles. As a result, finding a paper with a certain title is quite easy in 
CITIDEL. Accordingly, in a future version of Citiviz, we will add features to better 
support this type of task. 

After users completed all tasks, they were asked to fill out questionnaires. It ap- 
pears users believe that Citiviz is easy to use and helps them find documents easier 
and faster than would a traditional tool. Users also think the scatter plot and the hy- 
perbolic tree are helpful, although some users think that the hyperbolic tree for the 
ACM classification scheme is too big and too complex. 

The hyperbolic tree of the ACM classification scheme usually has three levels 
(depth-oriented). If users know the exact topic, it is still difficult to locate the topic in 
the hyperbolic tree. Unlike Citiviz using ACM classification scheme, the hyperbolic 
tree of Citiviz clustering component usually has one level (breadth-oriented). If users 
know the exact topic, it is easy to locate the topic in the hyperbolic tree and to find the 
document. 




Task 1 : Given an Task 2: Given an Task 3: Given a title, Task 4: Find the 
author and a topic, author and a find the document. most recent 

find the document. publication year, find published paper, 

the document. 

Task 

[□ Citiviz - ACM Classification ■ Citiviz - Citiviz Clustering □ Citidel.org~| 



Fig. 7. The results of the user study 



5 Conclusion 

The result of our work is a DL visual interface tool ldt combining text mining and 
information visualization. It uses a 2D scatter plot to visualize document attributes 
(e.g., rank, date) as did Envision [17, 25]. Unlike Envision, the 2D scatter plot space 
also integrates a network of citations to show the document relationships. A further 
difference of our work from Envision is that we integrate document clustering and 
information visualization to show the insight of similarity among documents as well 
as predefined document attributes. Though some approaches such as ThemeScapes 
[27] show the inter-document similarities, they display data in a completely flexible 
manner and do not provide an overview of document attributes. 
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The visual interface provides overviews of retrieved results from CITIDEL. The 
overview strategy of aggregation by document clustering provides users insights of 
how similar documents are clustered. The overview of a hierarchical concept map 
displayed as a hyperbolic tree supports “focus + context” navigation. “Focus + con- 
text” navigation provides direct manipulation and high interaction, and therefore a 
balance of local detail and global context. The overview of document attributes such 
as query relevance shown in the 2D scatter plot space allows users to understand why 
a document is retrieved. Integrating the 2D scatter plot space with a network of cita- 
tions shows users document citation relationships. All these address the last two ques- 
tions mentioned in Section 1 . 

The componentized and XML based architecture of our project makes the tool kit 
reusable for different DLs. The Data Source Component we developed provides a 
data source from CITIDEL, which serves as a portal to its member DLs such as the 
ACM DL. So, in addition to being a visual interface to CITIDEL, the result of our 
project is also to provide a visual interface to its member DLs. Connecting our tool kit 
to another different DL can be completed easily by implementing a Data Source 
Component for that DL. Accordingly, after some small improvements identified in 
this study are made to Citiviz, we plan to deploy it for larger scale testing with 
CITIDEL, CITIDEL-member DLs, and other DLs such as NDLTD (www.ndltd.org). 
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Abstract. In maintaining Digital Libraries, having bibliographic data up-to-date 
is critical, yet often minor irregularities may cause information isolation. Unlike 
documents for which various kinds of unique ID systems exist (e.g., DOI, 
ISBN), other bibliographic entities such as author and publication venue do not 
have unique IDs. Therefore, in current Digital Libraries, tracking such biblio- 
graphic entities is not trivial. For instance, suppose a scholar changes her last 
name from A to B. Then, a user, searching for her publications under the new 
name B, cannot get old publications that appeared under A although they are by 
the same person. For such a scenario, since both A and B are the same person, it 
would be desirable for Digital Libraries to track their identities accordingly. In 
this paper, we investigate this problem known as name authority control, and 
present our system-oriented solution. We first identify three core building 
blocks that underlie the phenomenon, and show taxonomy where different 
combinations of the building blocks can occur. Then, we consider how systems 
can support the problem in two common functions of Digital Libraries - Update 
and Search. Finally, our test-bed called OpenDBLP is presented where the sug- 
gested solution is fully implemented as a proof of the concept. 



1 Introduction 

A bibliographic Digital Library (DL) such as DBLP [3], CiteSeer [9] or e-Print arXiv 
[TO] archives a collection of articles and their citation data in a certain domain. Often, 
such a DL is a starting place for researchers to locate relevant works, and a good test- 
bed for various citation analysis studies. A citation or reference consists of various 
bibliographic fields (e.g., author, title, conference/journal name or year), which we 
refer to as Bibliographic Entity in this paper. Often, documents have ways to track 
their identity. For instance, similar to the case where ISBN can serve as a unique ID 
for books, Digital Object Identifier (DOI) [11] can provide a persistent ID for digital 
documents. Therefore, even if two citations are slightly different in format, if their 
associated DOIs are identical, then one knows two citations in fact refer to the same 
object in the real world. As DLs evolve over the time, bibliographic entities change 
too. Especially, due to data-entry errors or different formats used in references, DLs 
often contain a large variety of values referring to the same objects. For instance, 
Figure 1, inspired from [14], is the screen-shot of a search session in CiteSeer, look- 
ing for a book “Artificial Intelligence: A Modern Approach” by Russell and Norvig. 
Note that CiteSeer currently lists “23” citations as different, but all of them refer to 
the identical book, thus must be consolidated into a single citation. 
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Fig. 1 . Search results showing name authority control problem. 

This so-called Citation Matching problem (or more generally known as Record Link- 
age [13] or Identity Uncertainty [14]) is mainly due to different formats people use on 
the Web or in publications. Toward this problem, people have devised various meth- 
ods to automatically detect duplicate or similar bibliographic entities that refer to the 
same object (e.g., [13][14][8]). For instance, using Levenstein edit distance or Jaro 
distance, one may detect that “S. Russell” and “Russell, Stuart” are the same person. 
However, to our best knowledge, there has been little work as to how systems can 
support updating and searching duplicates, once such matches are identified. Fur- 
thermore, previous research tends to focus on irregularities caused by errors (e.g., 
misspell, data-entry error). However, there are also “semantic” irregularities that are 
legitimate but unavoidable (e.g., a person changes last name after marriage). There- 
fore, in this paper, we investigate how to support maintenance of DLs once various 
semantic irregularities are identified. We especially focus on tracking two biblio- 
graphic entities - author and publication venue. First, let us consider various semantic 
irregularities of two entities that may occur in DLs: 

• Author: Since the identity of a person is often determined by the name, if person's 
name changes over time, system cannot keep track of the person's bibliographic re- 
cords uniformly. For instance, when an author “Alon Levy” becomes “Alon 
Halevy” after marriage, DLs view two authors as different persons. Likewise, when 
two scholars share the same name, system cannot differentiate them. For instance, 
DBLP views two “Wei Wang”s, one at U. North Carolina and the other at HKUST 
but both are database researchers, as one person. Another case is that DL cannot 
recognize different varieties of a name. For instance, “Lee D. Coraor” and “Lee 
Coraor” are treated as two different persons although both refer to the same profes- 
sor at Penn State. 

• Publication Venue: Similarly, the identity of publication venue such as confer- 
ence/journal/publisher is also determined by the name in DLs, but the name can be 
dynamic. For instance, multiple conferences may merge into a single conference 
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over time (e.g., “ACM DL” and “IEEE ADL” merged into “ACM/IEEE JCDL” in 
2001), or conversely a single conference can split into multiple conferences (e.g., 
“ACL” and “COLING” merged into “ACL-COLING” in 1998, then separated af- 
terward). Furthermore, the characteristics of a venue may change (e.g., a workshop 
“ML” has evolved into a conference “ICML”). 

To handle such semantic irregularities, running citation matching methods periodi- 
cally is one way. However, not only the accuracy of such methods is less than perfect, 
it is wasteful from a system point of view. Once DL learns that both “S. Russell” and 
“Russell, Stuart” are the same person, it is more desirable for the system to keep that 
knowledge to exploit it in future. Similarly, after DL learns that “ACM DL” and 
“IEEE ADL” were merged to “ACM/IEEE JCDL” in 2001, it may return publications 
from all three conferences for a query “find all publications in JCDL about Digital 
Identity after 1995” even if they are asked only to “JCDL.” 



2 Problem and Solution Overviews 

We consider a problem as to how to track bibliographic entities in DLs, typically 
known as Name Authority Control problem. Formally, we solve the problem: 

When bibliographic entities (i.e., author and publication venue) of citations 
change over time in Digital Libraries, devise a system support such that DLs 
can update and search the changes properly. 

Toward the problem, we first present three core elements - linear change, split, and 
merge - as basic building blocks, and discuss how systems can support those. More 
specifically, we discuss how UPDATE and SEARCH functions of DLs are changed to 
track bibliographic entities. Then, we present a proof of the concept, fully imple- 
mented in a test-bed, called OpenDBLP [1]. 



3 Related Work 

Citation matching problem has been extensively investigated under various names in 
various disciplines. For instance, it bears a great relevance to problems known as 
record linkage [13], identity uncertainty [14], merge-purge [2], etc. However, none of 
them concerns issues related to “system support” once matching citations are identi- 
fied. Furthermore, we are interested in individual bibliographic entities - author and 
publication venue. Works done in [4][8] aim at detecting name variants automatically 
using data mining or heuristics techniques. Our work is complementary to them since 
we focus on system support issues once such variants are (semi-)automatically identi- 
fied. Similarly, [5] introduces a method to find matching variants of named entity in a 
given text such as project name (e.g., DBLP vs. Data Base and Logic Programming). 
DOl [11] provide means to specify a persistent ID for digital objects. However their 
full acceptance is far from reality. Similarly, [6] discusses an effort to standardize 
author names using a unique number, called INSAN. [7] is a recent implementation 
for name authority control, called HoPEc, which bears some similarity to our ap- 
proach. The detailed comparison between our OpenDBLP vs. [6] [7] is summarized in 
Table 1. 
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Table 1 . Comparison between OpenDBLP vs. INS AN [6], HoPEc [7], 





INSAN 


HoPEc 


OpenDBLP 


Support for unique ID in the system? 


Yes 


No 


Yes 


Support for UPDATE? 


Yes 


Yes 


Yes 


Support for SEARCH? 


No 


No 


Yes 


Support for Linear Change ? 


Yes 


Yes 


Yes 


Support for Split! 


No 


No 


Yes 


Support for Merge ? 


No 


No 


Yes 



4 Our Approach 

Although many variations seem to exist in the name authority control issues, at the 
bottom, there are only three core elements as follows: 

1. Linear change (A— >B). A bibliographic entity A is changed to B. For instance, 
an author or a conference/journal name is changed over the time. 

2. Split A bibliographic entity is split into multiple ones. For in- 

stance, a conference can be broken into two or the publications of two scholars 
whose names have the same spelling can be split. 

3. Merge ({A^AJ^A). Conversely, multiple bibliographic entities are merged into 
one. For instance, two variants of a person’s name may be merged into one au- 
thoritative one. 

We first discuss how two common functions of DLs, Update and Search, can be 
changed to support three core elements. 



4.1 UPDATE Function 

Once name variants are identified (manually by a librarian/author or automatically by 
algorithms), the findings must be inserted into DLs to solve the name authority con- 
trol. For instance, suppose an author “Wei Wang” realizes that her publication list is 
mixed with another person whose name has the same spelling as hers. Then, she may 
need to specify which of the publications in the DL belong to her and which does not. 
Or, if a librarian finds that publications under both “Lee Coraor” and “Lee D. Coraor” 
are by the same physical person, he probably wants to merge two publications lists 
into one by informing the DL all the name variants. 




Fig. 2. One possible ER diagram for bibliographic digital libraries. 




Imagine a DL built on RDBMS with three tables shown in Figure 2: (1) Publications 
table contains citations of each publication, except their author names, (2) Persons 
table contains author-related information, and (3) PubPeri pubID, verlD, ...) tells 
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which publication is authored by which person. Since a publication can be authored 
by many co-authors together, to avoid INF violation, separate Persons table is 
needed. Also, note that there are placeholders to store the “alias” name variants, for 
both author name ( authorAlias ) and publication venue ( venueAlias ). 

1. Linear change. When an entity A is changed to B, the system sets A as an alias 
(either in the authorAlias or venueAlias column), and sets B as the current name 
(either in the author or venue column). Further, A’s publications are moved to B 
by changing (publD , perlD) pair in PurPer table. 

2. Split. Suppose an entity A needs to be split into B and C. Then, the system needs 
to know not only new names of A, (i.e., B and C), but also which of the A’s pub- 
lications belong to either B or C. Then, the system creates a new unique ID, 
perlD , for both B and C and moves their corresponding publications accordingly. 
Note that B and C can be the same name. For instance, when originally the pub- 
lications of two “Wei Wang”s were incorrectly mixed, separating them out is the 
case of split as in Wei Wang — » {Wei Wang, WeiWang}. 

3. Merge. When name variants, A and B, are merged into C, the system sets both A 
and B as aliases of C. Since all of A, B, C still have unique ID, perlD , in the sys- 
tem so that when users want, he/she can still search using old name variants A 
and B. 



4.2 SEARCH Function 

Once duplicates are identified and put into the system, those knowledge can be ex- 
ploited in searching. Suppose a user is looking at all publications about XML by 
“Alon Halevy,” a database researcher at U. Washington, USA. When he submits two 
words “Alon Halevy” in the search box of DLs, internally, an SQL query similar to 
the following (assuming the schema of Figure 2) will be issued: 

SELECT PI.* 

FROM Publications Pi, PubPer P2 , Persons P3 
WHERE PI .pubID=P2 .publD AND P2 . perID=P3 . perlD AND 

P3 . author = 'Alon Halevy' and Pi. title LIKE '%XML%' 

However, what this user did not know is that DL keeps a separate list of publications 
by the same physical person, but under different name “Alon Levy”. When such re- 
lated information is updated by the previous function, now the system can do return 
merged list or display a link to publication lists under related name variants, etc. For 
instance, the following SQL query would return a merged list using “alias” columns: 

(SELECT PI . * 

FROM Publications PI, PubPer P2 , Persons P3 
WHERE PI .pubID=P2 .publD AND P2 . perID=P3 . perlD AND 

P3. author = 'Alon Halevy' and Pi. title LIKE '%XML%') 

UNION 

(SELECT PI . * 

FROM Publications Pi, PubPer P2 , Persons P3, Persons P4 
WHERE PI .pubID=P2 .publD AND P2 . perID=P4 . perlD AND 

P3. author = 'Alon Halevy' AND Pi. title LIKE '%XML%' AND 
P3 . authorAlias = P4. author) 

According to three core elements, over the time dimension, there are various strate- 
gies on how DLs can react to such a search function. Suppose conferences, Cl to C8, 
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have evolved as follows: Cl— >C2 (i.e., name change), C3— >{C4,C5} (i.e., conference 
split), and {C6,C7}— >C8 (i.e., conference merge). Three possible strategies for 
searching conferences after the name evolutions are illustrated in Table 2. Both 
“ backward ” and “forward” schemes are temporal strategies where the system 
searches related conferences toward backward or forward on a temporal dimension. 
For instance, in the backward strategy, when a user searches for C2, system shows C2 
as well as all its predecessors, Cl, as answers. Another possible strategy is the “se- 
mantic” search, where all semantically related results are returned, regardless of the 
temporal aspect. That is, the semantic strategy is equal to the union of both backward 
and forward strategies. For instance, since C3 is broken into C4 and C5, whenever 
browsing C3 occurs, it is expanded to all conferences related to C3, thus C4 and C5. 
Note, however, that browsing C4 is not expanded to C5. 

Table 2. Various searching strategies when name authority control is considered. 



Search 


Return 


Cl 


C1.C2 


C2 


C1.C2 


C3 


C3.C4.C5 


C4 


C3.C4 


C5 


C3.C5 


C6 


C6.C8 


C7 


C7.C8 


C8 


C6.C7.C8 



Search 


Return 


Cl 


Cl 


C2 


CI.C2 


C3 


C3 


C4 


C3.C4 


C5 


C3.C5 


C6 


C6 


C7 


C7 


C8 


C6.C7.C8 



Search 


Return 


Cl 


C1.C2 


C2 


C2 


C3 


C3.C4.C5 


C4 


C4 


C5 


C5 


C6 


C6.C8 


Cl 


C7.C8 


C8 


C8 



(a) Backward (b) Forward (c) Semantic 

4.3 Taxonomy of Name Authority Control 

By combining the three core elements as basic building blocks, one can cover various 
real patterns found in most DLs. Suppose two elements can be concatenated by 
the concatenation operator (e.g., split * merge). We have analyzed DBLP thoroughly 
and uncovered various cases where concatenations of two elements need to be sup- 
ported. Table 3 shows the taxonomy, where (1) different alphabets (A, B, ...) repre- 
sent different name strings, (2) different subscripts (A ,, A 2 , ...) for the same alphabet 
represent name variants, and (3) bold fonts represent the active authoritative names 
(i.e., not an alias). 



Table 3. Taxonomy of name authority control for three core elements. 



Concatenation (*) 


Linear 


Split 


Merge 


Linear 


Case 1: 

A— >B— >C 


Case 2: 

A— >B— >{B1,B2} 


Case 3: 
A— >B 
{B,C}— >B 


Split 


Case 4: 
A— >{A1,A2} 
Al— >B 


Case 5: 

A— >{A1,A2}— >{Ala,Alb} 


Case 6: 
A— »{A1,A2} 
{A1,B}— >A1 


Merge 


Case 7: 
{A,B}— »A— >C 


Case 8: 

{A,B}— >A— >{A1,A2} 


Case 9: 
{A,B}— »A 
{A,C}— >A 
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Case 1 (Linear * Linear). An author happens to change his name twice over the time. 
After applying a solution of Linear twice, a name A is changed to C. 

Case 2 (Linear * Split). An author changes name from A to B, but there is the same 
name B already, mixing two publications together. To avoid the mixture, split is ap- 
plied to B, yielding two name variants B, and B 2 . 

Case 3 (Linear * Merge). At time t 0 , an author used A as the name, but later at time t, , 
he changed A to B. However, due to data-entry error, C was used to refer to B as well. 
To fix this, merge is applied to B and C, making B as the authoritative name. 

Case 4 (Split * Linear). An author A, wants to change his name to B, but currently his 
records are mixed with another name A,. 

Case 5 (Split * Split). When three authors’ publications are mixed under the one 
name. A, then two split operations are applied in a row. 

Case 6 (Split * Merge). An author A found that his publication records are mixed 
with other name A. Also, he also found that B is his alias. Therefore, he wants to 
separate his records from A first, and consolidate them with B’s records. Furthermore, 
he likes to use A, as the authoritative name. 

Case 7 (Merge * Linear). An author A wants to merge his publications registered with 
his name variety B to his. Also, he wants to change his name to C. Since the name A 
and B cannot be changed to completely different name C at once by using ‘Merge my 
profiles’, they firstly merged to the name A. Then, the name A can be the name C. 

Case 8 (Merge * Split). At time t 0 , two conference names, A and B, are merged into 
A, but later at time t p A is changed to C. 

Case 9 (Merge * Merge). When an author has many name variants (e.g., “Jeffrey 
Ullman,” “J. Ullman,” and “Ullman, Jeffrey D.”), multiple consecutive merge opera- 
tions can be applied. 




5 System Implementation 

5.1 Overview of the OpenDBLP 

Since the proposed solution is implemented in a test-bed, called OpenDBLP, in this 
section, we give a brief overview, as shown in Figure 3. The OpenDBLP is a rejuve- 
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nated version of the popular DBLP digital library with a few novel improvements: (1) 
fully DBMS-based storage system, supporting ranked and approximate query process- 
ing, (2) web service based programmable interface (box in Fig. 3) to the contents of 
DBLP, and (3) a web client program that faithfully mimics the original FORM inter- 
face of the DBLP. Especially, this program fully implements old “Browse” and 
“Search” interfaces using only web services. 

The prototype system is accessible at http://opendblp.psu.edu/. 



5.2 UPDATE Function in OpenDBLP 

Here, we briefly show how three core elements are implemented in OpenDBLP. Us- 
ers can update his/her records (after manually or automatically finding some semantic 
irregularities) using one of the tree menus shown in Fig 4. 




Fig. 4. OpenDBLP UPDATE menus. 



Figures 5-7 demonstrate (1) the “Alon Levy” case of linear change, (2) “Wei 
Wang” case of split, and (3) “Lee Coraor” case of merge, respectively. 




Fig. 5. Changing a name for “Alon Levy”: before (left) and after (right). 
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Fig. 6. Creating a new profile for "Wei Wang”: before (left) and after (right). 



For the name change of “Alon Levy” as shown in Fig 5, the person name in Person 
table is changed to the new name and the old name is stored as an alias. Note that it is 
important to keep old records of “Alon Levy” although his current name is “Alon 
Halevy” for historical purposes - some users may ask queries specifically using his 
old name. 

For spliting bibliographic information of “Wei Wang” in HKUST, OpenDBLP 
creates a new record in Persons table and assigns a new perlD. Then, all the right 
publications (checked by users through web interface) are carried along to the newly 
generated perlD. At the end, there are two “Wei Wang”s in the system, each kept 
separately and recognized as different despite their same spelling by the system. 




Fig. 7. Merging two records for “Lee Coraor”: before (left) and after (right). 



For the third example of “Lee Coraor”, OpenDBLP chooses one of the names as his 
current name (i.e., the authoritive one) and stores another name as an alias. The perlD 
for all his publications in PubPer table is re-written to the perlD of the chosen name. 

5.3 SEARCH Function in OpenDBLP 

Once updates are successfully made, in searching, OpenDBLP can exploit its knowl- 
edge on name authority control using one of the strategies in Section4.2. Figures 8 - 
10 illustrates the improved search in OpenDBLP that can automatically show, for 
instance, current as well as “old-but-relevant” authors or publication venues. 
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Fig. 8. Search result for “Lee Coraor” after update: before (left) and after (right). 



3 The OpcnDOl P Microsoft Inlernel (xpbirnr provided by ATBT 1 abi Research flT ^ 




Ht ft* Ml Faverto Tank Hr*> J* 

Qtat ' J *1 1^1 'f'gj / Mnt> rtnrte 0 ' *• ^ 


i_ y s 

79. Bernard C Lew < • ~ 

80 

81. Arthur J Lavy 

82. David C . Levy 

83. Eric B. Levy 

84. Jacques Levy-Vehel 

85. Avivit Kapah-Levy 

86. Jacques Levy Vehel 

87. Jacob Y Levy 

88. Ronald M Levy 

89. Matthew A 

90. Richard S Levy 

91. William B Levy 

92. Lawrence B Levy 

93. Silvio V F Levy • 

94. David N L Lew 


k 


“Alon Y. Halevy” is 
turned together with “A 
Y. Levy.” 


re- 

lon 


95. Par Jean-Claude Le/y 

96. Francois© Levy-diwehei 

97. Lisa M Levy KojjjjJgh* 




98 

99 Alon Y Halevy (a.k.a. Alon Y. Levy ) 






tO internet 



Fig. 9. Search result for “Levy” after update. 
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6 Conclusion 

In the paper, we re-visited the name authority control problem in Digital Libraries 
(DL) and presented a solution that is, unlike the previous approaches, more focused 
on the system-support issues of how DLs can effectively support the problem in (1) 
updating records with name authority control problem, and (2) searching related re- 
cords by exploiting the knowledge about name authority control. We have identified 
that three are mainly three core elements that lie in most name-related changes as 
follows: (1) linear change of entity from A to B; (2) split of an entity A to multiple 
entities; (3) merge of multiple entities into single one. Using different combinations of 
these core elements, we have shown that many of the name authority problems can be 
expressed and thus solved. Finally, all the proposed solutions are fully implemented 
in a test-bed, OpenDBLP system, so that two of the common functions of DLs, Up- 
date and Search, can fully track down the right bibliographic entities despite the usage 
of different values. Although our proposed solution was tested only on a particular 
domain (i.e., DBLP), the techniques that we have developed can be applied to other 
DLs in a straightforward manner. 
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Abstract. Name Search is an important search function in Digital Library sys- 
tems and various types of information retrieval systems, such as directory 
search systems, electronic phonebooks and yellow pages. The paper discusses 
two main approaches to fuzzy name matching - the natural language processing 
(NLP) approach and the information retrieval (IR) approach - and proposes a 
hybrid approach. Person names can be considered a (sub-)language, in which 
case a name search system will be developed using Natural Language Process- 
ing apparatus including dictionary, thesaurus and grammatical schema. On the 
other hand, if names are perceived as (free) text, then an entirely different sys- 
tem may be built incorporating indexing, retrieving, relevance ranking and 
other Information Retrieval techniques. These two schools of thought, NLP and 
IR, have somewhat different sets of techniques originating from different theo- 
retical concerns and research traditions. A selective combination of their com- 
plementary features is likely to be more effective for fuzzy name matching. 
Two principles, position attribute identity (PAI) and position transition likeli- 
hood ( PTL ), are proposed to incorporate aspects of both approaches. The two 
principles have been implemented in an NLP- and IR-hybrid model system 
called Friendly Name Search (FNS) for real world applications in multilingual 
directory searches on the Singapore Yellowpages website. 



1 Introduction 

Name Search is an important search function in digital library and various types of 
information retrieval systems, such as online library catalogs, bibliographic retrieval 
systems, and yellowpages. A person’s name can exhibit many variations and forms in 
published documents, and users searching for a name may enter a variant form not 
found in the documents and text, or not matching the form indexed in the system. Yet 
few systems offer fuzzy name matching to help users retrieved records with variant 
person names. For example, a user issuing an author name search as "Lee Kuanyew" 
is likely to miss a record indexed as "Lee, Harry Kuan Yew" although both refer to 
the same person - the first Prime Minister of Singapore. Similar instances involving 
slight mis-spelling or mis-ordering in the queries will result in fruitless name 
searches. 

R. Heery and L. Lyon (Eds.): ECDL 2004, LNCS 3232, pp. 145-156, 2004. 

© Springer-Verlag Berlin Heidelberg 2004 
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Name searches fail not only because of errors in the users’ search query, it is also 
because names have numerous acceptable variants. For instance, the name "Harry 
Kuan Yew Lee" is a Chinese name with native and adopted name tokens, where 
"Lee" is the surname, "Kuan Yew" the given name, and "Harry" the adopted name. 
Similarly, "Nurdini Abu Bakar Aljunied" is a Malay name with an adopted (Arabic) 
surname "Aljunied." To illustrate this, different forms of a name are listed in Table 1 
according to five types of variations: Name Alternation (NMA), Sound-alike (SAL), 
Abbreviation (ABB), Short-hand (SHH), and Contraction (CON). To further quantify 
the fuzziness of the name, the number of variants is also included in the table. 



Table 1. Possible Variations of Person Names 



Type of 
Variation 


Name Instances 


Estimated number of variants 


Name Alter- 
nation 
(NMA) 


Harry Kuan Yew Lee or 
Kuan Yew Harry Lee 


(At 1)! 

where N is the number of name tokens 
(=6, when N= 4) 


Nurdini Abu Baker Aljunied or 
Aljunied, Nurdini Abu Baker 


Sound-alike 

(SAL) 


Harry Kuon You Lee or 
Harrie Kuan Yew Lee 


where |Wj| is the number of characters in the 
name token W) (=~ 4 4 • 6 = 1536, assuming 
N=4 and |W,| = 4) 


Noordini Abu Baker Aljunie or 
Nurdini Abu Baker Aljuneid 


Abbreviation 

(ABB) 


Harry K Y Lee or 
H. Kuan Yew Lee 


C 2* 'V 1 )‘(N- 1)! 
(= 42, when N = 4) 


Nurdini A. B. Aljunied or Nurdini 
Abu Baker A. 


Short-hand 

(SHH) 


Hari Kuan Yew Lee or 
Har Kuang Yew Lee 


Similar to SAL 
(=-1536 assuming \W\ = 4) 


Dini Abu Barker Aljunied or Nur 
Abu Baker Aljunied 


Contraction 

(CON) 


Harry Kuanyew Lee 


Similar to NMA 
(=6 when N= 4) 


Nurdini Abubaker Aljunied 



Many Natural Language Processing (NLP) and Information Retrieval (IR) techniques 
have been applied in search systems that can handle the complexity demonstrated in 
Table 1 (e.g., Keen, E.M., 1992). Fundamentally, they are motivated by distinct theo- 
ries of the nature of names, namely: 

Name-as-Language (NAL): Person names follow a conventional style of writing - a 
kind of grammar. More specifically, names can be parsed into components such as 
surname, given names and other limited number of name attributes. 

Name-as-Text (NAT): Person names are a text consisting of tokens as indexing 
features. More specifically, methods such as relevancy ranking and query expan- 
sion can be applied to rank name search results and accommodate name variants. 
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In Section 2, we discuss the Name-As-Language (NAL) view through a review of 
NLP techniques applicable to automatic name search systems. The principle of at- 
tribute-position-identity is proposed to reflect the Name-As-Language perspective. In 
Section 3, the Name-As-Text view and IR techniques are discussed, and the principle 
of position-transition-likelihood is proposed to reflect the Name-As-Text perspective. 
Some name search systems are reviewed in Section 4, and a hybrid model of an 
automatic name search system called Friendly Name Search is introduced in Sec- 
tion 5. 



2 Name-as-Language (NAL) - The Deep Structure of Names 

Adopting the generative grammar perspective in linguistics, we assume that names, 
like sentences and other text units, can be generated in a rule-governed fashion by a 
name grammar. A relevant question is: 

What is the “deep structure” of a name, in terms of the “form” and “sound,” that 
remains constant throughout its conventional variants? 

Technologists with this view would build name search systems based on the nor- 
malization of names aiming at recovering the deep structure that is constant across its 
variants. After normalization, the search process becomes a relatively straightforward 
template matching of attributes and values. However, the challenge is to compute the 
deep structure. This will require a native-speaker's language intuition. For example, to 
recognize names such as, "Harry Kuan Yew Lee" (Chinese), "James Hla Gyaw" 
(Burmese), and "Savar Sankaran Narashimhalu" (Indian), the attribute-value tem- 
plates that need to be computed are as shown in Table 2. 



Table 2. Chinese, Burmese and Indian Name Structures 



Name Attributes 


Harry Kuan 
Yew Lee 


James Hla Gyaw 


Savar Sankaran 
Narasimhalu 


Surname (SN) 


Lee 


- 


- 


Native Given Names (GN) 


Kuan Yew 


Hla Gyaw 


Narasimhalu 


Acquired Name (AN) 


Harry 


James 


- 


Father’s Name (FN) 


- 


- 


Sankaran 


Place Name (PN) 


- 


- 


Savar 



As demonstrated in Table 1, the surface form of the names has many variants. As a 
result, a regular grammar with disjunctive operators is needed to describe all potential 
forms of the names. A partial grammar for Indian names is illustrated in Table 3. 



Table 3. Indian Name Structures 



IN Type 1 


GN [FN] SN 


Vimol Goel (Hindi names) 


IN Type 2 


[FN | PN] + GN 


Savar Sankaran Narasimhalu (e.g., Tamil) 
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The deep phonetic structure of a name token is simply its phonemes. However, name 
transliteration increases the complexity of the problem. For example, there is likely 
only one Arabic spelling in Arabic script for a name like "Sulayman." However, there 
are many Sound-alike (Spell-alike) versions in romanized forms such as "Suliman", 
"Seleiman", and "Solomon". The same phenomenon is observed in Chinese names 
where "Zhi," in standard Hanyu Pinyin, can be spelt as "Jih", "Jyh", "Ji", "Chi", 
"Chih" and so on. This is shown in Table 4. 



Table 4. Arabic and Chinese Name Structure (Sound) 



Surface Names 


Phonetic Transcription 


Sulayman, Suleiman 


s.u.l.ey.m.ax.n 


Salayman, Seleiman, Sylayman 


s.ax.l.ey.m.ax.n 


Suliman 


s.u.l.ih.m.ax.n 


Solomon 


s.ao.l.ao.m.ao.n 


Zhi, Jih, Jyh, Ji 


zh.i 


Chih, Chi 


ch.i 



Note: The transcription is based on the DECtalk system described in (Conroy, et. al., 1992). 



2.1 Challenges to Name-as-Language-Based Name Search Systems 

The grammatical formulation of person names has the advantage of being precise. 
The disadvantage is that it needs to be exhaustive. One such example is the Anapron 
system developed by Golding (1991) to pronounce names of different ethnic origins. 
The system contains 90 language identification rules, 205 morpheme rules, 619 tran- 
scription rules, and several hundred rules on syllable and stress structure assignments. 
Even with all this effort, the system only covers the sound variants. More rules will 
be needed to cover the form variants shown in Table 1. 

An alternative to the grammar-rule-based approach is the Hidden Markov Model- 
based grammatical tagging systems (Church. 1988; DeRose, 1988). The task of as- 
signing name attributes to name tokens can be seen as similar to that of assigning 
grammatical tags to words. This empirical approach requires a tagged corpus to de- 
velop the model. Such a corpus is easier to construct compared to dictionaries and 
grammatical rules for names. Another advantage is that this approach is nondetermin- 
istic, i.e. more than one plausible processing result can be computed, which allows 
further spelling disambiguation to be applied in the case of uncertainty. In fact, one 
such real world example for normalizing other types of phrasal units, such as ad- 
dresses, has been developed using the same approach as grammatical tagging (Wang 
and Chuah, 1994). 

Similarly, the sound aspect of names can equally be addressed through an empiri- 
cal NLP perspective. For example, Bosch and Daelemans (1993) described a data- 
oriented method for grapheme-to-phoneme conversion, whereby a statistical measure, 
Information Gain, is applied to induce rules for transcribing Dutch words. 
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2.2 Position-Attribute-Identity (PAI) Principle 

Whether it is a rule-based or an empirical approach, an NLP-based name search sys- 
tem requires much manual effort and time, to build the dictionary, thesaurus, 
schema/grammar rule, and linguistic tagging resources, which most likely will render 
an NLP-based approach infeasible. 

We propose a resource economical approach called the position-attribute-identity 
principle, which is simply: 

Name positions are literally taken as attributes for the name structure. 

With positions as the attributes/tags and using a dynamic programming-like con- 
straint checking process, the most plausible positions for each name token is identi- 
fied and used as the basis for ranking and retrieving actual name records. More details 
are given in Section 5 of the paper. 

Since the positional information of a name token can be readily accessed from a 
name database, no additional manual tagging is necessary during this process. That 
the position-attribute-identity principle is plausible can be seen from the fact that in a 
fully normalized name database, controlled by cataloging rules, the position corre- 
sponds exactly to the attribute of the name tokens. For instance, the first token for 
both Chinese and English names is the surname. 

Taking the surface structure literally as the deep structure reduces resource over- 
head. This approach can also be applied to the sound aspect of names. In fact, Bosch 
and Daelemans (1993) used a straightforward table look-up method to transcribe the 
sounds of a Dutch name, outperforming statistical approaches such as those based on 
Information Gain. 

In summary, what are gained by examining the Name-As-Language view of name 
search systems are the following: 

1 . Recovering the deep structures of the form and sound aspects of names makes 
name search systems more effective. But the resource required by the normaliza- 
tion process can be a serious constraint. 

2. To overcome resource overhead, it is necessary to take readily available infor- 
mation from the name data as the basis for language modeling. The position- 
attribute-identity principle is proposed. 

3 Name-as-Text (NAT) - The Feature and Similarity of Names 

From a Name-As-Text perspective, names are seen as sets of characteristic features. 
For example, in a name database consisting of "Harry Kuan Yew Fee", "Harry Hui 
Kuan Deng", "Jack Yew Hui Lee" and "Peter Wu," nine distinct features are identi- 
fied: (1) "Harry," (2) "Kuan," (3) "Lee," (4) "Hui," (5) "Deng," (6) "Jack," (7) "Yew," 
(8) "Peter," (9) "Wu,” as shown in Table 5. 

The order of the name tokens is significant for distinguishing names. However, a 
name can have more than one correct order, as illustrated earlier in Table 1. A name 
like Harry Lee Kuan Yew, represented as < 1, 3, 2, 7>, is the same as Harry Kuan 
Yew Lee (<1, 2, 7, 3>), while Harry Yew Kuan Lee (<1, 7, 2, 3>) is a different name. 
To measure the similarity between different name order precisely is a challenge. 
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Table 5. Name Records and Feature Representations 



Name Records 


Feature Representation 


Harry Kuan Yew Lee 


<1,2, 7, 3> 


Harry Hui Kuan Deng 


<1,4, 2, 5> 


Jack Yew Hui Lee 


<6, 7, 4, 3> 


Peter Wu 


<8, 9> 



Considering "Sound-alike" variants of names from the viewpoint of NAT, is con- 
trasted with the IR research focus on "Meaning-alike," where queries are expanded 
with synonymous terms using a thesaurus. In IR, if the system can judge the similar- 
ity between terms accurately, then "Meaning-alike" expansion with synonymous 
terms will increase the retrieval effectiveness (Qiu and Frei, 1993). However, this 
same principle cannot be applied literally to "Sound-alike" expansion. 



3.1 Challenges in Name-as-Text-Based Name Search Systems 

Most IR approaches rank relevance based on term frequency (tf) and inverse- 
document frequency (IDF). The less frequently a particular token i appears across the 
collection (i.e. the lower the document frequency, and the higher the inverse docu- 
ment frequency, IDF) the more characteristic the token is to the name record. Simi- 
larly, the more frequent a token occurs in a name record (term frequency), the more 
characteristic the token is to the name. However, IDF and tf overlooks the attribute 
aspect of the token treated in the Name-As-Language view. For example, in the ex- 
ample "Robert Kong Kong Tan," "Kong" can be a surname ("Robert Kong Kong 
Tan”) or a given name ("Robert Kong Kong Tan ”), in which case the surname is 
“Tan.” If the target record is "Robert Kong Kong Tan” (with “Kong” as the surname), 
then a retrieved record "Robert Kong Kong Tan ." (surname “Tan”) is actually not as 
relevant a record as "Robert Kew Chong Kong " (surname “Kong”). 

Furthermore, phrase and proximity operators, such as adjacency, window size and 
directed window, widely used in traditional IR systems fail to tackle finer grained 
relevancy ranking among positional variations because these operators are Boolean 
operators giving binary relevance results: either satisfied or failed (Keen, 1992). 

There are essentially two approaches to ranking the relevance of records retrieved: 

1. the expanded tokens can be combined with the original tokens, upon which the 
accumulation of postings is done for each token in the combined query. We refer 
to this as the +-combination. 

2. postings in each of the Cartesian product of the expanded tokens can be accumu- 
lated. We refer to this as the x-combination 

Most query expansion techniques adopt the -(-combination type (e.g. Qiu & Frei, 
1993). However, in the case of "Sound-alike" type of name variation, the distinction 
in dimensionality cannot be maintained by just a simple -(--combination. For example, 
given the expansion groups {"Kon", "Kong", "Khon"} and {"Yan", "Yen", "Yang"} 
and the -i-combination, a query string "Kon Yang Kong," targeting the name “Kon 
Yang Khon”, can be expanded into {"Kon", "Kong", "Khon", "Yan", "Yen", "Yang", 
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"Kon", "Kong”, "Khon"}. Positional information is not maintained. Thus, a name 
such as "Kon/SN Yang/GN Chee/GN" 1 (not the targeted name) will be retrieved with 
an equal relevancy score as "Kon/SN Yang/GN Khon/GN" (the targeted name), both 
getting 4/9 from Dice's coefficient 2 . This is because the dimensionality of "Kon", a 
surname, and “Khon”, a given name, are collapsed into an indistinguishable group in 
the +-combination process. 

In order to avoid the drawbacks described above, the x-combination of query ex- 
pansion has to be adopted. However, computationally, the x-combination leads to 
combinatorial explosion. For example, if a name has m tokens <A 1 ,...,A m >, and each 
yields |Aj tokens after query expansion, the commonly used +-combination will result 
in Xvl™ A I q uer y expanded tokens, whereas the x-combination will yield ]”[ T™ A, \ 
tokens. How to overcome this serious constraint is the topic of the next section. 

3.2 Position-Transition-Likelihood (PTL) Principle 

To restrict the combinatorial expansion in the x-combination query expansion, one 
approach is to incorporate a filtering mechanism while maintaining the same dimen- 
sionality. We propose the position-transition-likeUhood principle: 

The likelihood of a transition between a pair of name tokens, in terms of their posi- 
tions, is used to filter the expanded queries in x-combination. 

Using this principle, out of the 27 results from the x-combination of the expanded 
"Kon Yang Kong," only one (the correct one) is plausible - "Kon Yan Khon," whose 
position transition, Kon pml ->Yan pos2 and Yan pos2 ->Khon pm3 , was found to be highly 
probable. 

In summary, our analysis of the Name-As-Text view results in the following: 

1 . Traditional measures of similarity and relevancy in IR are not sufficient for 
automatic name search systems, although the lead time to an operational Name- 
As-Text-based system can be shorter than a Name-As-Language-based one. 

2. Overcoming the combinatorial explosion for more precise retrieval is crucial. As 
such, the position-transition-likelihood principle is proposed for filtering out 
unlikely combinations. 



4 Previous Work on Name Search 

Systems which deal directly with the task of automatic name search include the Syn- 
oname system (Siegfried and Bernstein, 1991; Borgman and Siegfried, 1992), devel- 
oped by a team under the Getty Art History Information program to archive art works 
by around 6,000 artists. When museums exchange cataloging information, without a 
proper name matching procedure, artworks by the same artist may be cataloged under 



1 As shown in Section 2, SN stands for Surname and GN, Given Name. 

2 Dice's coefficient is defined as 2(|Q fl Dj)/(|Q[ + |D|). 
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different names. The system's engine for name matching includes 12 comparison 
techniques: (1) Exact match, (2) Omission of one character, (3) Substitution, (4) 
Transposition, (5) Difference in punctuation, (6) Initials, (7) Extended name, (8) 
Inclusion of names within names, (9) No first name, (10) Word approximation, (11) 
Confusion of dividing names, and (12) Character approximation. 

The first 4 techniques concern fuzzy string matching within 1-Levenshtein dis- 
tance (Hall and Dowling, 1980). Techniques (7), (8), (9) and (11) are easily handled 
by an IR system, since they just mean the set of features in the query string is a subset 
of those of the name record. Technique (10) presents a challenge which can be cov- 
ered under our considerations of Sound-alike (SAL) variants. Technique (6) is exactly 
the same as those treated in Abbreviation-type (ABB) variants. This leaves only tech- 
nique (12), which is character approximation. For example, "Backhuyzen, Ludolf" is 
related to "Bakhuysen, Ludolf," and "D'Espagnat, George" to "Espagnat, George d'". 
These examples show that even strings with two Levenshtein distances away still 
need to be regarded as a match. Thus, technique (12) is just a generalization of tech- 
niques which handle cases of only 1-Levenshtein distance. In general, with the capa- 
bility for handling NMA-, SAL-, ABB-, SHH-, and CON-types of name variation, 
our approach can have the same flexibility in name matching as Synoname. 

Another interesting study on Name Search is a Ph.D. research by Hermansen 
(1985), who investigated the "New York State Identification and Intelligence Sys- 
tem." The two important aspects of names, form and sound (identified as "name 
structure" and "transliteration" in the thesis), was argued to be crucial for automatic 
name search systems. However, no system implementation was involved in the study. 
Also, as it was a rather early work, no reference to modem NLP or IR techniques 
were made. On the other hand, many aspects of the ad hoc name search algorithms 
were examined, providing a good review for the technologies available up to the mid 
80's. These ad hoc techniques are: sound-based similarity (Moore, 1965; Roughton 
and Tyckoson, 1985), n-gram entropy (Fokker and Lynch, 1974; Fokker, 1974), 
name subsetting (Taft, 1970), and record linkage (Moore, 1965). Pfeifer et al (1996) 
performed experiments for measuring retrieval effectiveness of various proper name 
search methods. They argue that phonetic similarity (PHONDEX) works as well as 
typing errors (Damerau-Levenstein metric) and plain string similarity (n-grams), and 
the combinations of these different techniques perform much better than the use of a 
single technique. 

Beli and Sethi (2001) discussed potential matching algorithms for patient identifi- 
cation resolution for use with a massively distributed Master Patient Index, which is a 
facility to make all patients’ medical records in the U.S. accessible to care providers. 
The patient identification resolution considers additional attributes in addition to 
name, such as address, telephone, social security number, and date of birth. Name 
searching is also important in the fields of machine translation and cross-lingual in- 
formation retrieval. Stalls and Knight (1998) and Virga and Khudanpur (2003) 
worked on translating names and technical terms using phonetic translation (e.g., 
from English to Mandarin). Pirkola et al. (2003) investigated a fuzzy cross-lingual 
translation of proper names and technical terms, but no phonetic elements were in- 
cluded in the techniques. 



NLP Versus IR Approaches to Fuzzy Name Searching in Digital Libraries 1 53 



In summary, existing name searching systems use mostly ad hoc techniques origi- 
nating from disparate fields of study, such as fuzzy string matches, rule-based pattern 
matching, record-linking, and soundex schemes. In contrast, our system, the Friendly 
Name Search system adopts a theory-driven alternative for automatic name search 
system development. However, it is acknowledged that the form and sound aspects of 
names across the world are largely still an ad hoc phenomenon. Thus, in an opera- 
tional environment, ad hoc methods may still be required to address certain peculiari- 
ties. 



5 Friendly Name Search (FNS) - 

Towards a Theory-Driven Automatic Name Search System 

Human Logic iSearch is a name search solution from Mustard Technology. 3 The core 
technology called Friendly Name Search (FNS) 4 , aspects of which are patented by 
Kent Ridge Digital Lab, of which Mustard Technology was a spin-off. The architec- 
ture of the current FNS system is shown in Fig. 1. The database name goes through a 
tokenization process before being indexed. During the process, a domain specific 
name thesaurus and metadata are incorporated in the name modeler to produce infor- 
mation based on the Position- Attribute-Identity (PAI) and Position-Transition- 
Likelihood (PTL) principles. A Fuzzy Name Index is produced at the end of the in- 
dexing process. 

The query names are transformed and processed similarly as the names in the da- 
tabase, which are then matched, scored and ranked based on the fuzzy name indices 
to produce the search results. 

Both the PAI (position-attribute-identity) and PTL (position-transition-likelihood) 
principles concern the overall collection, instead of individual postings. Fig. 2 dem- 
onstrates the case where three query tokens, A, B and C, are expanded. The first to- 
ken is expanded into 4 tokens, A1 to A4, and similarly for B and C. Each expanded 
token has frequency counts on different positions. For example, the first expanded 
token A1 has frequency counts represented by Al.l and A1.4, for positions 1 and 4; 
while A2 has A2.1, A2.2 and A2.4, for positions 1, 2, and 4. (In Fig. 2, these are 
sorted in position order.) Thus, a potential result from the x-combination of the ex- 
pansion is {Al.l, B1.2, C1.3 j; on the other hand, {A2.1, B3.1, Cl.l} is illegal, since 
in this case, all of A2, B3 and Cl are in position 1 of a name, which is not allowed. 
Thus it is pruned away from the final result. 

For each legal result (X., i= I to mj, where X. is the name token at position i, the 
score is calculated as: 

Kfreq( x i ) + fm( x M )) x f re q( x i , x M )] ( l ) 



3 Mustard Technology’s website URL: http://www.mustardtechnology.com 

4 “A System of Organizing Catalog Data for Searching and Retrieval” Patent No: US 
6,381,607 B1 on 30 Apr 2002. 
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Database Indexing Process Query Process 




Fig. 1. Flow Diagram of Friendly Name Search 



*illegal* *illegal* 




Fig. 2. Illustration of PAI and PTL principles in Name Modeler 

The rationale is that the more frequent a position is associated with a token and a 
position transition is associated with a token pair, the more likely is a result contain- 
ing the tokens and token pair is to be relevant. And the similarity is proportional to 
the frequencies of the occurrences of such tokens and pairs in the name. The actual 
accumulation of postings still needs to be executed to enumerate the matching re- 
cords. 

Computationally, a further advantage is noted from applying the principles. The 
process demonstrated in Figure 2 actually prunes away those illegal results whose 
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postings need not be accumulated, saving index access time. The reduction gained is 

generally in the order of the difference between — and TT I A I , where N 

5 J (N — M)\M\ ~ AA,=1 1 ' 

is the potential positions (usually less than 10), M is the token length of the query 
name, and |Aj the sizes of each of the expanded token set. 

A version of the Friendly Name Search has been deployed in a website, called Sin- 
gapore Yellowpages, located at http://www.yellowpages.com.sg. The site contains 
millions of records and is among the most accessed website in Singapore. 



6 Conclusion 

Technologists building theory-driven name search systems are confronted with two 
seemingly different alternatives: the Name-as-Language (NAL) and the Name-as- 
Text (NAT) approaches. The NAL-based approach treats names as word sequences 
generated by rule-governed grammar. As an alternative to the NAL view, the NAT- 
based approach assumes names are just records consisting of features derivable from 
name tokens. Based on NAL, a more data-driven method, called Position-Attribute- 
Identity (PAI) principle, is proposed. The PAI principle regards name positions as 
attributes in name structures. The position-transition-likelihood (PTL) principle, 
which is motivated by NAT, together with the PAI principle, is introduced to prune 
and verify the query expansion process. Thus, a theory-driven name search system, 
called Friendly Name Search (FNS), is built by combing the complementary advan- 
tages of both the NAL and NAT approaches to achieve effectiveness both in system 
development and quality of search. FNS has been applied to real world application in 
Singapore Yellowpages and many organizations in the public, banking and telecom- 
munications sectors. 
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Abstract. Libraries of digitized multimedia content provide access to virtual 
entities. In the case of music, where there are frequently many different per- 
formances, editions, and arrangements of a given work, the Variations2 meta- 
data model, links all instances of a work to an abstract work record, thus yield- 
ing superior search capabilities to digital library users. This paper summarizes 
the motivation for addressing the music metadata problem and describes the 
Variations2 search user interface, which is based on our work-centric, FRBR- 
like metadata model. 



1 Introduction 

The Variations2 Indiana University Digital Library is a large test-bed development 
and research project funded in part by Phase 2 of the Digital Libraries Initiative, with 
support from the National Science Foundation and the National Endowment for the 
Humanities [1]. This paper reports on the state of the Variations2 test-bed software, 
describing in particular the search user interface. We begin by reviewing the motiva- 
tions for attempting an improved environment for music search. Some of these moti- 
vations are common to other digital library efforts; others are specific to issues asso- 
ciated with music. We then describe our implementation of a search user interface and 
the current state of our system. 



2 Background 

Motivations for the Variations2 approach to searching come from at least two direc- 
tions. First, Variations2 shares in larger library and digital library issues associated 
with virtualization. Second, music information offers unique challenges, challenges 
which have not always been met well by existing solutions. 

2.1 Virtualization, Abstraction and New Metadata Models 

This paper springs from the junction of two simultaneous developments: library virtu- 
alization and catalog entity abstraction. 
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Digital libraries provide a level of disembodiment of library materials. Digital ma- 
terials have a reduced physicality in at least three respects. First, patrons cannot pick a 
digital item off a shelf and hold it in their hands. Digital library contents are less tan- 
gible. Second, the collocation of items in a collection need no longer be spatial in a 
physical sense. Hence the term virtual , while not synonymous with digital, is often 
used to describe digital libraries. Third, reduced reliance on physicality also becomes 
evident as users seek content (i.e., works) rather than containers (e.g., the “red book,” 
the “CD with a picture of a dog”), influenced at least in part by the MP3 phenomenon 
where users tend to think in terms of “tracks” or individual works. Effective metadata 
for resource discovery thus becomes even more important as physical browsing is no 
longer possible in digital library environments. 

Over the last several decades, librarians have been reconsidering cataloging mod- 
els. To a large extent, reconsideration has been driven by the development of coopera- 
tive cataloging and the consequent need for common practices brought about by such 
systems as OCLC’s WorldCat [2] and RLG’s Union Catalog [3], Such efforts also 
afford opportunity beyond mere consistency towards fundamental improvements to 
the overall model. One such improvement effort is the Functional Requirements for 
Bibliographic Records (FRBR) effort from the international Federation of Library 
Associations and Institutions (IFLA) [4], 

FRBR seeks to improve upon the existing paradigm of MAchine Readable Cata- 
loging (MARC, [5]) bibliographic and authority records, the paradigm used by coop- 
erative cataloging efforts such as OCLC’s. The MARC-based paradigm stores infor- 
mation about the physical item in a bibliographic (“bib”) record. It also has authority 
records for such information as work titles, people’s names, and subject headings 
which help ensure consistent and unique naming. However, MARC-based implemen- 
tations often provide no linking between the record types. Lor example, a cataloger 
will find the name authority record for a book’s author but may not have any way to 
reference that authority record explicitly within the bib record or enact global changes 
across the system. Instead, the authoritative name for the author is copied separately 
into each bib record. 

In contrast to MARC, FRBR uses an entity-relationship approach to provide strong 
linking between records. For example, in FRBR, an item (e.g., a copy of a book) is an 
exemplar of a manifestation (e.g., all books with the same ISBN), which embodies an 
expression (edition) of a work (the abstract entity representing the original intellectual 
or creative content). This strong linking can be used to provide both collocation and a 
coherent disambiguation path for users. 

The FRBR specification has been used as the basis for some system development. 
FRBR-based projects include FRBR support within the VTLS Virtua system [6], the 
AustLit Australian Literature Gateway [7], RLG’s RedLightGreen [8], and OCLC 
WorldCat’ s Fiction Finder [9]. 

When the Variations2 Indiana University Digital Music Library project began 
more than three years ago, we determined to develop a metadata model that would 
support a greatly improved search interface for music [10]. The weaknesses of 
MARC-based music cataloging are well documented (see, e.g., [11]; we review them 
briefly in the next section). While not based directly on FRBR, the Variations2 meta- 
data model nonetheless bears a strong resemblance (Table 1). 

Like FRBR, our system is work-centric, being influenced by the work of both Ve- 
lucci [12] and Smiraglia [13]. We have implemented a digital music library, Varia- 
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tions2, based on that metadata model, have deployed the system in our music library, 
and have seen increasing usage over the past year and a half. 



Table 1. Variations2 and FRBR Compared 



Variations2 


FRBR Rough Equivalent 


Entity 


Description 


Work 


abstract concept of a musical composi- 
tion or set of compositions 


Work 


Instantiation 


recorded performance of a work (audio) 
or edition of a work (score) 


Expression 1 


Container 


physical item or set of items within 
which one or more instantiations are 
present (e.g., a CD or CD set, a score) 


Manifestation 2 (physical em- 
bodiment of an expression, e.g., 
release, edition) 


Media Object 


digital sound file(s) or score image(s) 


Item 3 (an actual copy of a mani- 
festation) 


Contributor 


individuals or groups related to a work, 
instantiation, or container (e.g., com- 
posers, performers, conductors, produc- 
ers, ensembles) 


Two Entities: 

- Person 

- Coiporate Body 



Notes: 



1. In FRBR an expression can be manifested multiple times; in Variations2, instantiations 
are unique to a container, even if two containers reflect the same performance. 

2. “A manifestation may embody one or more than one expression” [7, p. 13]. The Varia- 
tions2 Container, however, is less abstract, having some amount of item-level descrip- 
tors. 

3. The FRBR item refers to a copy in a collection; the Variations2 media object is a digiti- 

zation of a container. Thus in FRBR, there are potentially many items for a manifesta- 
tion; in Variations2, there is only digitization of a container, even if multiple media ob- 
jects are needed to capture all the container’s content. 



2.2 Finding Music in a MARC-Based OPAC 

Online Public Access Catalogs (OPACs) are the primary means by which library 
users access library collections. OPACs offer searching of bibliographic records (al- 
most always) in the MARC bibliographic format, and under certain circumstances 
provide to the user a list of authorized and unauthorized (i.e., cross-referenced) 
names, titles, or subjects from MARC authority records. Despite many advances that 
have been made to OPACs since library catalogs first went online, searching for mu- 
sical materials in OPACs can still be problematic, due to both OPAC design and to 
the structure and contents of the MARC bibliographic record itself. 

Library catalog records are created by a convergence of a number of different stan- 
dards. The MARC Bibliographic format prescribes the fields, subfields, and indicators 
used to mark what type of information is being recorded. The basic descriptive infor- 
mation that is contained in the MARC record is copied from the item being cataloged 
and is formatted according to the Anglo-American Cataloguing Rules (AACR2 [14]). 
"Access points" - other descriptive information formatted in a standard way, not di- 
rectly copied from the item being cataloged - are similarly selected and formatted 
according to AACR2 rules. Subject headings are chosen from controlled lists, most 
often the Library of Congress Subject Headings (LCSH). 
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The MARC format and its associated data content standards provide precision to 
bibliographic data. Encoding of information in bibliographic records, for example, 
allows the distinction between works by a person and works about a person, while 
still providing for a connection to be made between them by using the same form of 
the name in both places. The catalog of MARC records provides both a descriptive 
function - reproducing exactly what is on a physical item allowing users to access 
titles or authors they've seen - and a collocating function - grouping bibliographic 
items representing the same authors, subjects, and, to some extent, works. 

One challenge to music searching is the MARC record's focus on a "static physical 
artifact" [11, p.2]. The data in a MARC record describe a bibliographic item as a 
whole, not necessarily any specific part of it. This is problematic because items held 
in a music library, especially sound recordings, often contain multiple works. Thus 
there is often no way for a user to know, for example, which of the performers listed 
in a record is connected with a given piece on the recording being described. 

The Nature of an OPAC. The OPACs in use in libraries of all sizes today are typi- 
cally one part of large Integrated Library Systems (ILSs) used for automation of many 
library services, including acquisitions, cataloging, circulation, and patron billing. 
OPACs from different vendors also have vastly different native functionalities, and 
are customizable by the library implementing them. Search and browse success in an 
OPAC relies heavily on the design and implementation decisions for an individual 
installation in addition to the nature and structure of the bibliographic data in the 
MARC records it contains. 

Indiana University’s IUCAT, based on the Sirsi Unicorn Integrated Library System 
(ILS), is a fairly typical example of a modem OPAC. Keyword searching in a large 
number of fields from the MARC bibliographic record is available, as is browsing and 
searching on fields (actually groups of fields from the MARC record) labeled as au- 
thor, title, subject, series, periodical title, and medical subject. Basic Boolean opera- 
tors and term truncation are available. The OPAC performs reasonably well for sim- 
ple “known-item” bibliographic searches such as for author or title, but less well for 
more specialized queries essential to music searches, such as for discovery of pieces 
with specific instrumentation. Cataloging rules place names of instruments in multiple 
fields within the MARC record. But these fields do not use terms for instrumentation 
in a consistent manner, so a keyword search of the entire record on instrument names 
will not find all relevant records in the catalog, and will at the same time add many 
irrelevant records. One partial solution was the creation of a dedicated field in the 
MARC record for instrumentation, but this field is rarely used, due in part to the 
amount of time it takes to add this information to bibliographic records, but also 
largely due to the fact that almost no OPACs, including Indiana University’s, index 
this field for searching or display it to users. 

Collocation by Work. Collocation by work is one of the functions of cataloging 
wherein OPAC designers and consequently OPAC users often do not succeed. MARC 
and AACR2 provide basic work collocation through a mechanism called the uniform 
title. All records describing the same musical piece, whether in score or recording, 
have the same uniform title. There are also additions to music uniform titles that indi- 
cate, among other things, whether a record is for an arrangement of a musical work, a 
part of a musical work, or a musical setting of a textual work. This in theory allows 
connections to be made between multiple versions of the same work and its varia- 
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tions. For example, a score or recording of Bach’s complete 2-book Well-Tempered 
Clavier would have the uniform title Wohltemperierte Klavier, the Prelude and Fugue 
#1 from Book 1 would have Wohltemperierte Klavier, IT, Nr. 1, and an arrangement 
for guitar of the entire work would have Wohltemperierte Klavier, Arr. 

But this work collocation function is often not readily available to the average li- 
brary catalog user. First, uniform titles are not present in all bibliographic records. 
Cataloging rules governing appropriate use and the presence of records created before 
the uniform title achieved its present form are among the reasons a uniform title may 
be missing from a given record. Second, many OPACs don't make full use of the 
uniform title for display purposes. Many catalogs provide basic grouping capability 
on the first part of the uniform title (the actual name of the work), but then fail to 
meaningfully use the other parts of the uniform title that indicate format, arrangement 
or selection, and the like. Similarly, most library systems do not use the semantic 
links between whole works and their parts that uniform titles provide [11, p.4]. Cur- 
rent OPACs on the whole do not recognize this link, and thus fail to retrieve the larger 
work when a part is searched. 



3 Implementation 

In this section, we describe the current (version 2.1.1) Variations2 search user inter- 
face, including the options available on each of the four tabs (basic, advanced, key- 
word, and browse). We also describe how the disambiguation process varies depend- 
ing both on what fields the user fills in and the actual content of the digital library. 

The Variations2 software is cross-platform (Windows and Macintosh), imple- 
mented as a Java application. While the search interface could have been imple- 
mented in a web browser, the other features of Variations2 (audio player, score 
viewer, etc.) would not have worked as well within a browser window, so we decided 
to implement the entire application as separate Java windows. The technical architec- 
ture of Variations2 is beyond the scope of this paper, but a description may be found 
in [15]. 

3.1 Search Tabs 

The Variations2 search window (Figure 1) is the default initial window displayed by 
the application after users log in. The search window is divided into two sections: the 
search tabs, where users specify their search criteria, and the results pane, where the 
results of the search are displayed. The results pane has a row of controls above it for 
forward/backward navigation, canceling a search in progress, or changing the display 
of the results by sorting or filtering. 

Basic Tab. The search window defaults to the Basic tab, which provides five fields 
for search criteria specification. 

— Creator/Composer (like author, but music is different) 

— Performer/Conductor (critically important for music) 

— Work Title (often different from the name of the container) 

— Key (two drop-down lists: key letter and mode, e.g., A, minor) 

— Media format (drop-down list with various types of recording and score formats) 
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Fig. 1. Search Window, Basic Tab 

In all of the text entry fields in the search interface, the following properties apply. 

— Case insensitivity 

— Partial words are matched by default, e.g., “beeth” will find Beethoven 

— Quotation marks permit searching for the exact word or phrase 

— Other punctuation and diacritics are ignored 

Advanced Tab. The advanced tab offers the same fields as the basic tab, with the 
following additions. 

— Recording/Score Title (i.e., container title) 

— Other Contributor (e.g., arranger, producer) 

— Publisher 

— Subject Heading 

Keyword Tab. The keyword tab offers two fields. 

— Keywords(s) - accepts parentheses and the Boolean operators and, or, and not 

— Media format (drop-down list with various types of recording and score formats) 

Browse Tab. The browse tab (Figure 2) offers browsing of the entire collection. Us- 
ers select one of the following '‘browse by” options. 

— Creators (composers, poets, lyricists, etc.) 

— Works 
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Fig. 2. Search Window, Browse Tab 



— Performers 

- Recording albums/score volumes 

Users can initiate a search either by pressing the Enter key on their keyboards while 
they are in one of the text fields, or by clicking on the Search button. 

3.2 Results Display and Interaction 

The results display area uses a Java Swing component to render HTML. Descriptive 
text is black, hyperlinks are blue, and there are also buttons of various colors. Figure 3 
shows a part of the Figure 1 results display. In the gray box at the top of each result 
set is a description of the results that follow. The main entry (first line) for each result 
is in a larger font, and the matching part of the string (if any) is bolded. The 1 iconic 
button indicates detailed information is available. 



4 creators matching bach were found . 

Select All 



I.Name: Bach. Carl Philipp Emanuel 1714-1788 O 

Role: Composer 



Fig. 3. Search Results Detail 
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Results are not paged: all results are returned. If there are not any results for a 
given search, the results pane indicates which criteria matched something in the data- 
base so users can broaden their search appropriately. Sample “zero results” output is 
given in Figure 4. 



NOTHING MATCHED ALL OF YOUR SEARCH CRITERIA. Try changing 
or removing search elements. 

- 1 Creator matches "beethoven". 

- NO Performers match "mazel". 

- 64 Work Titles match "symphony". 



Fig. 4. “Zero Results” Feedback 
The results pane control buttons work as follows. 

• The forward and back buttons work like the buttons in a web browser: back dis- 
plays the previously displayed search results, changing the tabs and search criteria 
at the top of the window as appropriate. Forward moves in the opposite direction 
through the results stack. Cancel stops a search in progress. 

• Sort By allows users to change the ordering of the displayed search results. The 
choices depend on the record type currently displayed. For example, the list of 
creators in Figure 1 may be sorted alphabetically by name or role. The list of con- 
tainers in Figure 2 may be sorted by title or (first listed) composer. 

• Show allows filtering of containers by media type (score or recording). 

Whenever a View or Listen button is present in the search results, clicking that but- 
ton, or clicking the title on that same line, will launch the Variations2 score viewer or 
audio player, as appropriate. 

An alternative navigation mechanism is available from a right-click popup menu 
(Figure 5). In this example, right-clicking on the score name offers two choices: open- 
ing the score in the score viewer (the default behavior had the link been clicked) or 
viewing detailed information about the score. 



Concerto gros ^n A minnr fnr 9 vinlinc anrl <?trinn nrrhpslrp np. 3. no, 8 O 

View 

Vivaldi. Antonio II 

Show detailed information for this recording/score 

E -tru arm on i rci. N 



Fig. 5. Right-Click Menu Navigation 



Right-clicking on “Vivaldi, Antonio” in Figure 5 also gives two options: getting de- 
tailed information about Vivaldi, or launching a new search for works by Vivaldi. 



Viewing Record Details. Users may request record details either by using the popup 
menu or by clicking on the I button. Record details are displayed in a separate win- 
dow, also using HTML and having both internal and external links. Internal links 
allow convenient navigation within a container record. External links provide record 
details for referenced records, bring up an audio player or score viewer, or provide 
links to external resources. 
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Disambiguation. The search logic in Variations2 provides step-by-step disambigua- 
tion during searches. Disambiguation steps are inserted in the search process when all 
of the search criteria can be satisfied by a variety of results, but 

— a name used as search criteria matches more than one individual or collective name 
in the database, or 

— a work title used as search criteria matches more than one work title in the data- 
base. 

There is a set sequence to the disambiguation. In the “worst” case, a user specifies an 
ambiguous creator, performer, work title, and other contributor. First the user is pre- 
sented with search results listing all the matching creators where the other criteria also 
have matches. After selecting the desired creator, the user is presented with the list of 
all matching performers who perform works by the selected creator, the other criteria 
still matching, etc. In this worst-case scenario, the user is not presented with media 
links until the fifth set of search results. Typically, however, only one or two disam- 
biguation steps are required. If, at any disambiguation step, users want to see all the 
results without having to disambiguate, they can click the “Select All” link (Figure 3). 



3.3 Current State 

Cataloged content in Variations2 is somewhat limited at present. In March 2004, the 
digital library contained records for 1500+ works, 1300+ contributors, in support of 
282 containers (262 recordings and 20 scores). The collection grows in response to 
pilot project needs, development team testing needs, and an overall goal of broaden- 
ing the collection. 

Variations2 is installed on approximately 120 computers in the music library. Any 
person with an IU login can come to the music library and use Variations2. While the 
primary mechanism for online access to music at IU is still IUCAT and Variations 
(our previous-generation digital music library [16]), Variations2 is available for gen- 
eral use. 



4 Conclusions and Future Work 

This paper documents the current Variations2 digital music library search user inter- 
face as a user-centered, FRBR-like alternative to traditional MARC -based OPACs as 
mechanisms for finding music. We have carried out multiple lab-based and field- 
based evaluations; results will be published separately. The short summary of our 
evaluation results is that we found no fundamental flaws with the user interface or the 
design of the metadata model. Such problems as were uncovered seem addressable by 
relatively non-invasive user interface improvements. 

Variations2 is a continuing research project. Among the search-related features 
planned for future releases are the inclusion of themes and incipits in the search inter- 
face, initially as a means for users to distinguish works but eventually as a mechanism 
for limited content-based searching of music. We also plan on adding search fields for 
instrumentation, genre (e.g., jazz, pop, rock), musical form (e.g., song, symphony, 
opera), and style (e.g., baroque, romantic). To the current audio recording and 
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scanned score formats we plan to add encoded scores. We are also considering im- 
plementing a web-browser-based search interface. 

Variations2 is designed as a distributed solution, for use by multiple institutions. 
The current implementation is more monolithic, based on the collection of a single 
institution. As we evolve Variations2 to fulfill its distributed promise, we will have to 
consider how a distributed “union” catalog can be used within the search interface 
(while ensuring only authorized access to the digital content!). Only by addressing 
barriers to distributed deployment can we develop the cooperative cataloging com- 
munity necessary to support re-cataloging and thereby a future existence for our 
metadata model and software. 
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Abstract. In this paper, we explore approaches to multi-lingual information re- 
trieval for Greek, Latin, and Old Norse texts. We also describe an information 
retrieval tool that allows users to formulate Greek, Latin, or Old Norse queries 
in English and display the results in an innovative clustering and visualization 
facility. 



1 Introduction 

Cross-lingual information retrieval is a particularly intriguing technology for students 
and scholars of Ancient and Early-Modern Greek and Latin or Old Norse. Works 
written in these languages are extremely important for understanding our literary, 
scientific, and intellectual heritage, but these languages are difficult and few people 
know them well. In particular, this technology can be extremely useful for non- 
specialist scholars and students who are somewhat familiar with these languages, but 
who do not know enough to form a mono-lingual query for a search engine. Students 
of Ancient Greek literature, for example, might want to know more about the quality 
of ‘cunning intelligence’ that is admired and exemplified in the character of Odysseus 
in Homer’s Odyssey. Because this quality is multifaceted, it would be very difficult 
for readers to formulate a query for this type of passage if they were working only 
with an English translation of the text; they must rely on the consistency of the trans- 
lator. A cross-lingual information system, on the other hand, would help students 
identify words or key phrases - such as the Greek word for cunning intelligence, 
‘metis’ - and then study passages where they appear. 

Such a system is, of course, only the beginning. At best, it can identify passages 
that need further study and translation since a user who cannot formulate a query 
probably cannot easily read the text in its original language either. While a great deal 
of work has been done on these sorts of systems in venues such as the Cross Lingual 
Evaluation Forum {CLEF) and the Translingual Information and Detection program 
{TIDES), their focus has largely been on business journals, newswires, and national 
security applications. Our work has focused on evaluating how the needs of students 
and scholars in the humanities differ from those in other domains and developing a 
system to meet these needs. 
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2 Context and Testbeds 

The work described in this paper takes place in the context of the Cultural Heritage 
Language Technologies consortium (http://www.chlt.org), a jointly funded project of 
the National Science Foundation and European Commission Information Society 
Technologies Program. This project is a collaborative effort of eight partner institu- 
tions located in both the United States and Europe. Many of these partners have con- 
tributed corpora and core technologies that we have relied on in our work. Our test- 
beds for this project include the six million words of Greek and four million words of 
Latin with parallel translation from the Perseus Digital Library (http://www. perseus. 
tufts.edu); more than one million words of Latin drawn from early printed works in 
the history of science from Special Collections department at the Linda Hall Library 
in Kansas City (http://www.lindahall.org); a 750,000 word corpus of Early-Modern 
Latin from the Stoa consortium at the University of Kentucky (http://www.stoa.org); 
a corpus of Isaac Newton’s alchemical, theological, and chemical papers from the 
Newton Project at Imperial College (http://www.newtonproject.ic.ac.uk/); and a cor- 
pus of Old Norse sagas from the University of California at Los Angeles. In addition 
to these textual testbeds, the Perseus Project has also provided its parsers and ma- 
chine-readable dictionaries for Greek and Latin while the group at UCLA is creating 
comparable resources under the aegis of this project. 



3 Approaches to the Problem 

The problem of multi-lingual information retrieval is essentially one of machine 
translation on a very small scale. There have been two dominant approaches to this 
problem: 1) dictionary translation using machine-readable multi-lingual dictionaries 
and 2) automatic extraction of possible translation equivalents by statistical analysis 
of parallel or comparable corpora 1 . 

Dictionary translation is a low-cost search technology that translates queries by 
substituting each word in a query with translations automatically derived from the 
machine -readable dictionary. This approach by itself is not very good, achieving 
results that are only 40-60% as effective as a mono-lingual search ([4-6]). The pri- 
mary problems of this approach are related to the introduction of extraneous words 
and ambiguity into the query due to the multiple senses contained in most dictionary 
entries, the failure of most machine-readable dictionaries to account for technical 
terms in a consistent way, and the loss of important fixed phrases. 

Automatic extraction of translation equivalents from parallel or comparable cor- 
pora introduces similar sorts of ambiguity and carries two additional problems: 1) 
these corpora can be extremely expensive to produce, and 2) these automatically 
extracted translation equivalents are most effective in restricted domains ([7-9]). 



1 There are, of course, other approaches. [1] points out that it is also theoretically possible to 
machine-translate target documents, but this technology is not yet feasible for most modern 
languages, let alone Greek, Latin, or Old Norse. See also [2] and [3] for an innovative ap- 
proach based on topic modeling. 
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The needs and nature of our user community of students and scholars in a humani- 
ties digital library suggest that we can profitably adopt both of these approaches if we 
take appropriate steps to reduce query ambiguity. The nature of the corpus of Ancient 
Greek and Latin and Old Norse texts makes it ideal for this project, as it is highly 
domain specific within some broad parameters 2 . Further, the corpus itself is very 
stable, so the cost of creating a parallel corpus is finite and the investment, once 
made, would have lasting value for students and scholars in its field. At the same 
time, these ancient languages have been highly studied and thus can benefit from the 
work of scholars who have developed comprehensive ‘unabridged’ lexica as well as 
domain specific dictionaries for both fields of discourse and specific authors. 

The information-seeking behaviors of the people who use digital resources in these 
languages also inform our approach. Students and scholars of ancient languages are 
almost a ‘hyper-fit’ for the profile of a user of a multi-lingual information retrieval 
facility. Very few specialists are trained to write and speak Greek, Latin, or Old 
Norse; advanced training - for the most part - focuses on reading these languages. 
This focus on reading, however, means that the user community is trained in a philol- 
ogical approach that focuses on the use of small families of words and that is attuned 
to the shades of overlapping meanings of different words. The example in the intro- 
duction of a scholar studying ‘cunning intelligence’ is not random but drawn from a 
book-length study of the word metis (Til])- Further, even the most skilled readers of 
ancient languages are well versed in the use of reference works such as grammars and 
dictionaries and are accustomed to using them regularly as they read. Classicist Mar- 
tin Mueller describes the user community as follows: “Very few readers know an- 
cient Greek well enough to read it without frequent recourse to a dictionary or gram- 
mar, and because of their highly specialized interests, the few readers who can do so 
are likely to be particularly intensive users of such reference works” ([12]). 

The nature of our users means that they are well equipped to help translate their 
query into the target language as long as they are provided with tools to help them in 
this process. In 1972, Salton demonstrated that with carefully constructed query ex- 
pansion thesauri, multi-lingual information retrieval tools could be as effective as 
mono-lingual tools ([13]). The information retrieval community has, however, es- 
chewed Salton’ s arguments for hand-constructed query expansion thesauri in favor of 
solutions that are more general and domain independent (i.e. [5], [8]). Salton’s care- 
fully constructed thesauri are still expensive but this is an expense that can reasonably 
be shifted to each end user at query time for humanities applications. A tool that helps 
them give feedback during the query translation process allows users to construct 
their own ad hoc query expansion thesauri, thus facilitating the construction of a 
query that is most useful for their needs. This approach does not preclude automatic 
disambiguation methods; as we will demonstrate below, we have developed a user 
feedback mechanism with tools to help end-users translate queries including easy 
access to machine readable dictionaries and several query-specific statistical meas- 
ures that assist users’ identification of relevant search terms. 



2 In fact, the Thesaurus Linguae Gracae already defines 86 restricted domains for the surviv- 
ing corpus of more than 71 million words written in Ancient Greek (see [10] and 
http://www.tlg.uci.edu). 
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4 Query Formation 

4.1 Query Translation 

The search facility begins with a simple interface that allows users to enter search 
terms in English, to select the sources that will be used for query translation, and to 
restrict their results to words that appear in works written by a particular author. 
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Fig. 1. Query Entry Screen 




Fig. 2. Query Translation Screen 
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Several of the options presented to the user in this phase are integrated with the 
larger digital library system and designed to scale up as new texts and reference 
works are added. The system for dictionary translation is based on a piece of middle- 
ware with a modular design that automatically extracts translation equivalents from 
any SGML or XML dictionary tagged in accordance with the guidelines of the Text 
Encoding Initiative or any other user-defined DTD. The author list restrictions are 
generated from the cataloging metadata from the digital library. 

After entering query terms, users are presented with an interface with detailed in- 
formation to allow them to construct the best translation of the word for their needs. 
This process can range from the simple elimination of obvious ambiguities and mis- 
takes to a careful consideration of every term. The interface provides a list of transla- 
tion equivalents for the word or words that the user entered along with an automati- 
cally abridged English definition of the word, a link to the full definition for each 
word, a list of authors who use the words, and data about the frequency of each word 
in works by the selected authors. 



4.2 Query Expansion 



One of the challenges of this sort of multi-lingual information retrieval system is the 
dependence on a match between the concept that the user wants to study and the 
translation equivalents provided in the dictionary entry for the word. For example, a 
user interested in searching for Greek words that might mean ‘story’ will find several 
very good translation equivalents, including the Greek word muthos that means 
“speech, story or tale” and is cognate with the English word ‘myth,’ as well as other 
words such as ainos, meaning “tale or story,” and polumuthos, a compound word 
meaning “much talked of, famous in story”. The first phase will, however, miss other 
related words that do not happen to have the word ‘story’ as part of their definition, 
such as epos, defined as “ that which is uttered in words, speech, tale.” 

To address this problem, we provide users with a query expansion option that sug- 
gests other words that are related to the exact matches returned by their initial query. 
These related terms are generated by an analysis of the definitions contained in the 
electronic machine-readable multi-lingual dictionaries. This process involves extract- 
ing all of the translation equivalents from the dictionaries and stripping suffixes from 
the translation equivalents using Porter’s algorithm. We exclude translation equiva- 



df ; ^ _ 

lents where > .3 with N equal to the number of definitions in the dictionary. The 



terms themselves are assigned a binary weight rather than a weight such as tf x idf. 
Our experiments with various weighting schemes revealed that they had very little 
impact on the results because documents were very short (just over four words on 
average). Having developed this index, we determine the entries that are most similar 
to each other using a simple Dice similarity coefficient 

2 | defi n defj I 

( sim(def j ,def i ) = -^ 1 — i r). The five words with the highest correlation 
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coefficient are then included in the results for the query translation phase of the proc- 
ess. 

In many cases - as in the above example of a search for the word ‘story’ - this 
process enhances what are already very good search results. By its nature, this proc- 
ess expands recall at the expense of precision, thus running the risk of presenting the 
user with too much irrelevant information in the query translation phases. Therefore, 
a user seeking a more precise query can switch off the query expansion function. 



4.3 Sources of Translation Equivalents 



Our current research is focused on determining whether the work of Church and Gale 
for the Oxford English Dictionary [14] can be applied to our parallel corpora of 
Greek texts with English translations and Latin texts with English translations. 
Church and Gale argue that a yf test can be used to determine translation equivalents 
in parallel corpora aligned at the sentence level. They posit a null hypothesis that 
words occur in parallel sentences independently or by chance. This null hypothesis is 
then compared with the actual count of term co-occurrence across parallel corpora 
block using the following equation: 

2 ( o-e y- 

X = with O equal to the number of times that a word pair appears 



together and E equal to the average number of times that the terms would appear 
together if they were evenly distributed across the entire corpus. Our hope is that we 
will be able to generate a dynamic thesaurus of translation equivalents based on our 
corpora and offer this thesaurus to our users alongside the machine-readable diction- 
aries that we are currently using in this interface. 

Church and Gale’s results are intriguing, but we need to determine if they can be 
applied to texts written in Greek and Latin. We are focusing our investigations in 
three key areas. 

First, Church and Gale worked on business documents written in English and 
French drawn from the Union Bank of Switzerland corpus. Greek and Latin have 
much more complex morphological structures and very free word order, so it is nec- 
essary to study the impact of these linguistic differences when applying this algo- 
rithm. 

Second, our corpora are aligned with a much lower level of granularity than the 
corpus tested by Church and Gale. Scholars traditionally refer to classical texts using 
a standard system, such as line number for poetry or page/paragraph numbers of an 
early printed edition for prose. For example, the works of Plato are referenced by a 
pagination system from a three-volume collection of Plato’s works published in 1578 
by Henri Estienne. The three volumes were numbered consecutively and each page 
was divided into sections with the division marked by the letters a-e. Plato’s dia- 
logues are cited using the name of the dialogue, the page number from this edition, 
and the letter from the section containing the beginning of the citation. Other prose 
works are divided in similar ways based on other early printed antecedents. Our par- 
allel corpora of prose are aligned at this level and the resulting blocks can range from 



174 Jeffrey A. Rydberg-Cox et al. 



a few hundred words to almost one thousand words. Poetry is even more complicated 
because line numbers offer a false sense of precision. In actuality, the number of lines 
in a translation can vary widely between the original and the translation and - even 
when this is accounted for - word order conventions are so different that words could 
appear on widely different lines. We have obtained good preliminary results by work- 
ing with aligned segments of ten lines, but we need to determine if this lower level of 
granularity will work generally across our corpora or - alternately - if we need to 
explore methods for working with comparable corpora rather than parallel corpora. 

Finally, this approach is similar to our query expansion routine in that it favors re- 
call over precision. We will need a detailed study of our results to determine whether 
or not the information we are adding is useful to users translating their queries. 



5 Visualizing Results 

After users translate their queries with these tools, the search is passed to a monolin- 
gual search engine with several visualization front ends (described in more detail in 
[15, 16]). These front ends are alternatives to the traditional ranked list view of search 
results and are based on the on-the-fly calculation of keywords for the documents 
returned by the query. Keywords are calculated using the equation: 

Wj =^-Xr\og(\R\/r) 

d 

u j 

where |R| is the total number of documents returned by the query, r is the number of 
documents in the returned set containing term j, and d, is the number of documents in 
the entire collection containing term j. This factor is used in favor of tf x idf ranking 
because it favors salient words within the returned document set that are also dis- 
criminative. By calculating these scores at query time based on the query and the 
returned document set, we are able to improve our results as compared to a weight 
calculated for each term in the collection calculated in the indexing phase. 

These interfaces group visually documents that our calculations have determined 
to be related, and label each group with the most appropriate keyword. They also 
offer users the opportunity to revisit some of the translation decisions that they made 
in the previous step, allowing them to eliminate certain keywords from the search 
results. A user may browse related documents or, alternately, refine searches by drill- 
ing down to sub-clusters. Our hope is that by placing related Greek or Latin passages 
in meaningful conceptual groups we will reduce the time the user spends sorting 
through a ranked list of search results. 

The first visualization interface is a tree view that represents documents as the 
nodes of a binary tree flattened into a circular pattern. Due to constraints on size of 
display, the tree is only displayed at five levels, with the bottom level representing 
further sub-clusters where appropriate. The terminal nodes are distinguished by color 
cues, with red nodes representing documents and yellow nodes as further sub- 
clusters. Each node is also labeled with the highest-frequency keyword associated 
with that cluster. 
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Apple 




Fig. 3. Tree Visualization of Search Results 



As the user mouses over the nodes, the selected nodes are highlighted, and the user is 
presented with a menu showing the number of documents and all of the keywords 
associated with that cluster. This menu also allows the user to drill down on any node 
and re-center the tree around the selected node. Further, within this visualization, the 
user is able to eliminate keywords from the search results, view fragments of every 
document in the collection, and follow a link to the complete document within the 
digital library. 
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Fig. 4. Sammon Visualization of Search Results 
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The second visualization generates a Sammon map that provides users with a vis- 
ual landscape for navigation. In this interface, each cluster is represented as a circle 
and is labeled with its highest frequency keyword. The radius of the circle indicates 
the relative size of each of the clusters, while the distance between the circles repre- 
sents the relative similarity of the different clusters. As in the tree visualization, 
mousing over a cluster provides a menu containing the size of the cluster along with 
its associated keywords and offering the user an opportunity to re-center the display 
around the selected cluster. 

The third display offers a radial visualization in which the twelve highest ranked 
keywords in the returned search results are displayed in a circle. Each document in 
the returned set is represented as a point in the middle of the circle with its placement 
determined by the relative pull of each of the keywords distributed around the circle. 
Users can determine the keywords contained in each document by mousing over each 
point. As in the two previous interfaces, this visualization allows users to eliminate 
keywords and follow links to a full text display in the digital library. 




Fig. 5. Radial Visualization of Search Results 

Further, this third interface allows users to adjust the clustering to suit their informa- 
tion needs. If they are interested in documents that contain keywords that are distrib- 
uted widely around the radial display, the interface permits them to select keyword 
nodes and move them around the circle. This action shifts the position of related 
documents within the circle and brings together documents that are most useful for 
the end user. 

Finally, although we hope the visual process will be more useful for our end users, 
we also are aware that people are not accustomed to these types of interfaces. There- 
fore, a traditional list with search results grouped together and ranked using the tradi- 
tional tfx idf score is available as well. 
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Fig. 6. Radial Visualization of Search Results with Dynamic Re-Clustering 



6 Evaluation and Future Research 

With these interfaces, we provide our users with a great deal of information that they 
can use to translate queries in a way that is most appropriate for their information- 
seeking interests. At the same time, we provide them with three innovative interfaces 
within which they can browse the resulting data. In addition to our work on automati- 
cally generated translation thesauri for Greek and Latin, our next phases will focus on 
user evaluation. 

We have already done testing on the quality of the clusters and received user feed- 
back on the visualization interfaces in English. We now need more controlled user 
studies of the clustering interface for Greek, Latin and Old Norse. The largest obsta- 
cle in this area is the lack of a standard set of documents, queries, and relevance 
judgments for the corpus of texts written in these languages that would allow us to 
generate standard precision and recall metrics for our work. As digital libraries ex- 
pand from modern European languages to cultural heritage materials, the need for 
these sorts of evaluation corpora will become more urgent if we are going to be able 
to effectively evaluate these sorts of tools. Groups such as the Cross-Lingual Evalua- 
tion Forum (CLEF) and the Document Understanding Conference (DUC) provide a 
model; building a consortium to follow their lead in creating an evaluation corpus for 
cultural heritage materials must be one of the next priorities for our project. 
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Abstract. The paper will present the work of the Forum on Information Stan- 
dards in Heritage (FISH) - www.fish-forum.info - in the development of stan- 
dards and protocols to support interoperability between historic environment 
sector information systems. The paper describes barriers to interoperability 
within the sector. These originate in the unique character of the historic envi- 
ronment as an information source. Progress in the development of relevant 
standards is reviewed and emphasis placed upon community building to support 
standardisation. Current work to develop an XML-based interoperability ‘tool- 
kit" of schema and protocols to support knowledge-sharing networks is de- 
scribed. This will be based on current FISH standards along with the CIDOC 
Conceptual Reference Model, an emerging ISO standard ontology for cultural 
heritage information. 



1 The Historic Environment Information Landscape 

The ‘historic environment’ is all around us. It consists of the totality of those aspects 
of the built heritage, archaeology and current and past landscapes that together form 
both the subject of study for academics, and a perceived ‘sense of place’ for those that 
live and work within such an environment. The holistic approach implicit in the 
phrase ‘historic environment’ presents particular challenges to the designers and man- 
agers of historic environment information resources (HEIRs). It is useful to introduce 
these challenges to provide a background to this presentation of the data standards 
that have emerged within the sector, the means by which they are developed, and the 
current work on interoperability to secure the benefits of that standardisation work. 



1.1 Multiplicity of Interests 

The historic environment is not ‘owned’ or curated by any one single organisation, 
and there are often many organisations interested in the same site. A contrast can be 
drawn between, for example, a museum object which will generally be documented 
by a single curating authority, and a Bronze Age burial mound. The latter may be 
recorded simultaneously by any or all of the following: a local authority for develop- 
ment control purposes, a national body for purposes of legal protection, the landowner 
for land management, a thematic national survey of sites of a particular type, a scien- 
tific or research group as the origin of a significant sample at the scale of microns and 
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a landscape survey project working on a scale of kilometres. This contrast is reflected 
in the distinction between process-based standards current in the UK museums sector 
(SPECTRUM from mda) and object-based standards for the historic environment [1], 



1.2 Separation of Data from Documented Object 

Features of the historic environment do not usually provide their own documentation. 
This is in contrast to an item of archive, which may well convey enough information, 
either within itself or by virtue of its context within a collection, for it to be ade- 
quately described. In consequence the recorded information about a Bronze Age bur- 
ial mound is arguably as significant as the original site. This issue becomes most 
acute in cases where the site no longer exists. In the jargon ‘preservation by record' 
means that recorded information may well stand as surrogate for the actual site itself. 

1.3 Unique Character 

No two features of the historic environment are identical, and it is this diversity which 
is often the subject of interest. Two Bronze Age burial mounds, for all their similari- 
ties, cannot be treated in the same way as two copies of the same book. Often there is 
uncertainty over the correct interpretation of such a feature. Is it really a burial 
mound? Is it really Bronze Age? Maximising future retrieval of records suggests the 
requirement to index all the possible alternative interpretations that the available evi- 
dence supports. Opportunities for rigorous rules-based classification of features of the 
historic environment are very limited, and have received little attention in comparison 
to, for example the classification of archaeological artefacts. 



2 Consequences for the HEIR User 

Faced by these challenges, historic environment information resource managers have 
developed many different software platforms and database designs. Even within a 
single subset of the sector, the Historic Environment Records (formerly known as 
Sites and Monuments Records or SMRs) maintained by English local authorities, 
Newman has identified the need for extensive auditing to promote consistent quality 
[2]. The Historic Environment Information Resources Network has surveyed and 
reported on the variability and fragmentation of these diverse information systems [3]. 

To achieve a full picture of the existing knowledge of a site it is therefore neces- 
sary to draw upon information from a large number of different information systems, 
which will in most cases have different physical designs, and quite often different 
underlying logical models. They will support different types of search, and will not 
provide consistent output. The process of transmission of data between these incom- 
patible data structures is therefore complex and costly. Each attempt to transfer data 
between systems requires individual design, so that to provide data to multiple part- 
ners quickly becomes prohibitively expensive. The routines developed are vulnerable 
to changes in technology in either partner in the exchange (Fig. 1). 
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Fig. 1. A complexity of different information management ‘tools’ isrequired for an HEIR to 
provide data to a variety of external partners. 



3 Data Standards for the Historic Environment 

The growth of personal computing technology from the late 1970’s onwards had a 
dramatic effect on the production and management of information relating to the 
historic environment, as in all other ‘memory-based’ sectors. A conference devoted to 
Computer Applications and Quantitative Methods in Archaeology has run annually 
since 1978: its first proceedings cover 84 pages. [4] The Royal Commission on His- 
torical Monuments of England (RCHME) computerised the English National Ar- 
chaeological Record, originally maintained by the Ordnance Survey, in 1985 [5]. 

The explosion of digital information was paralleled by development of standards 
and recommendations for the content of historic environment records. A data standard 
is taken to mean for this purpose an ‘agreed definition of what is to be recorded, and 
how, to achieve a particular objective’. In 1981 the Department of the Environment 
issued an Advisory Note setting out the recommended fields of information to be 
recorded in an SMR [6]. In 1993 the RCHME and English Heritage published ‘Re- 
cording England’s Past’ [7]. This recommended fields and specified terminology lists 
for the control of data entry to each field. A ‘Standard Data Format’ for the exchange 
of records of monuments, using ‘tagged data’ - text files using a form of mark-up 
language - was agreed by the RCHME and the representatives of the SMR commu- 
nity in 1994 [8]. 

These early standards were aimed at the professional sector. Increasing access to 
computing power and cheaper more user friendly database products stimulated the 
growth of the independent and voluntary sector inventories of the historic environ- 
ment. The availability of National Lottery funding from the mid- 1990's promoted this 
trend. The independent sector, exemplified by such groups as the Tiles and Architec- 
tural Ceramics Society, the British Sundial Society and The Letterbox Study Group, 
had a wider range of interest than the focused approach of the 1993 standard. They 
also needed more encouragement and assistance than a simple but inflexible standard 
could provide. In response, The RCHME, with partners from the English historic 
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environment sector published a new national standard, ‘MIDAS A Manual and Data 
Standard for Monument Inventories’ in 1998 [9]. In addition to discussion within the 
English historic environment sector, this drew in part on the international work under- 
taken by CIDOC, the documentation subgroup of ICOM, which had issued guidance 
on the recording of archaeological sites and historic buildings [10] during the 1990s. 

In parallel with standards for the content of information systems, terminology stan- 
dards such as the RCHME Thesaurus of Monument Types [11] have developed rap- 
idly, and multiplied. In 2000 a framework of terminology standards to complement 
MIDAS was established under the title INSCRIPTION [12[. 



4 Nuturing an ‘Information Ecology’ 

The phrase ‘information ecology’ is taken from the work by Nardi and O’Day [11], 
Their study describes the need for an information system to be regarded as one part in 
an interdependent community that must support the needs not just of the system 
builders, but also include the end users. The term ecology is deliberately chosen by 
Nardi and O’Day to evoke the sense of dynamism and fragility, inherent in a biologi- 
cal system. The same approach, 1 believe, can usefully be applied to the development 
of the standards that underpin information systems. This is particularly relevant in 
cases where interoperability is necessary, and a wide range of special interests need to 
be harmonised. It emphasises that information systems and data standards can only 
succeed where they also relate to the needs and experience of the creators and end- 
users of the information they relate to. 

Two issues serve to illustrate this point: intellectual property rights, and access to 
confidential or sensitive data such as the exact location of sites that have yielded 
valuable artefacts. No system or data standard to support interoperability of data will 
succeed without attention to issues such as these that might otherwise prevent or con- 
strain the movement of information between systems. 

In the historic environment sector such an ecology evolved from the early 1990’s 
work on ‘Recording England's Past’ . This established collaborative working between 
staff from the two major organisations in the English heritage sector. The decision to 
rework this standard into MIDAS, led to the creation of the Data Standards Working 
Party, with involvement from a wider, but still Anglo-centric, group. With publica- 
tion of MIDAS complete the group remained in existence to foster its development 
and implementation. A new title, the Forum on Information Standards in Heritage, 
England (FISHEN) was adopted, and involvement of representatives from the Scot- 
tish, Welsh and Irish heritage organisations was sought. Eventually, the U.K. wide 
Forum on Information Standards in Heritage (FISH) was established (spawned?) in 
2000. These then are the organisations (or organisms?) in the ecology. However the 
dynamism, the flow of energy round the ecology as it were, stems from the pattern of 
consultation and collaborative working which has emerged. At the heart of FISH is an 
email discussion list [14] with some 330 members from across the U.K. and from 
around the world. This is used to air ideas, share information, and seek advice and 
comment. The starting point for work on a new area of standards development is often 
a structured discussion or ‘e-conference’ held on the FISH list [15]. When a new 
standard is developed to a draft stage, the list can be used to contact potential review- 
ers to ensure that the new standard is relevant to the needs of the whole ‘ecology’ 
[16], A formalised peer review then either supports the approval of the new standard 



Building Interoperability for United Kingdom Historic Information Resources 183 



by the steering committee of FISH or recommends further re-working. Additional 
structure and robustness is given to the process by the adoption for FISH projects of a 
formal project management methodology, PRINCE2 [17], a system widely used in 
UK public sector information technology projects. 

Planned work for FISH includes the extension of the MIDAS standard to a broader 
range of historic environment resources, working towards publication of a second 
edition in 2005. The current focus of attention is, however, on the FISH Interoperabil- 
ity Toolkit. 



5 The FISH Interoperability Toolkit 

The work of FISH and its predecessors has done much to develop some commonality 
of content and terminology between HEIRs. However, the costs of export, manipula- 
tion and migration of data between systems are still prohibitive. To tackle this issue 
FISH has developed a vision of an interoperability ‘toolkit’, a range of protocols, 
formats, agreements and training materials necessary to provide HEIR developers and 
managers with the means to move data between systems. Subsidiary objectives in- 
clude the development of a format that will be suitable for the long-term storage of 
archived data from project databases, and to assist in the migration of data from old 
systems to new systems. A contractor, Oxford ArchDigital, will undertake develop- 
ment of the toolkit, with funding supplied by English Heritage and the National Trust. 
The toolkit will initially have the following technical components: 

5.1 The FISHXML Format and Data Validator 

This will be an Extensible Markup Language (XML) schema, based upon the MIDAS 
data standard. It will be designed using the CIDOC Conceptual Reference Model to 
ensure that the basic schema can be extended to meet the wider range of information 
envisaged by the forthcoming second edition of MIDAS. 

In addition, a separate tool will be developed to validate the content of FISHXML 
files. This will match terms used in the XML file exported from an HEIR with the 
terminology standards maintained within the INSCRIPTION framework, and report 
on possible problems. 

Together these will ensure the creation of standardised information resources, 
which can be passed from organisation to organisation, or system to system or depos- 
ited in digital archives. (Fig. 2). 



5.2 The FISH Historic Environment Protocol (HEEP) 

This will be a protocol for the remote querying and retrieval of data from FISHXML 
compliant HEIRs. Protocol requests and results will be delivered in FISHXML. 

This is the most exciting and innovative tool in the FISH Interoperability Toolkit. 
The protocol will support machine-to-machine exchange of ‘live’ information be- 
tween HEIRs, as opposed to the manual approach of record copying, export and im- 
port of data. While the FISHXML schema dictates the structure used to exchange data 
between systems, the protocol provides complaint computers with the instructions for 
secure and structured transmission and data exchange. 
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Fig. 2. The FISH Interoperability Toolkit sits between different HEIRs to assist with transfer of 
information. 



6 Progress and Prospects 

At the time of writing development work is underway on all these components, with a 
target date for completion of September 2004. Following on from that the Forum on 
Information Standards will maintain and develop the FISH Interoperability Toolkit on 
behalf of the historic environment sector. Future development may tackle other im- 
pediments to interoperability. One example is the problem of concordance between 
records derived from different HEIRs where it is important for one system not to 
include duplicate records for the same place. It is hoped that, using a standard format 
for the data such as the FISHXML format will support the development of software 
tools to automate the comparison of different datasets to identify sites recorded in 
both. 

Further developments will be discussed via the FISH discussion list and promoted 
via the FISH website www.fish-forum.info. All are welcome to participate. 
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Abstract. Information systems used in archaeology have several needs: inter- 
operability among heterogeneous systems, making information available with- 
out significant delay, long-term preservation of data, and providing a suite of 
services to users. In this paper, we show how digital library techniques can be 
employed to provide solutions to three of these problems. We show this by de- 
scribing a prototype for an archaeological Digital Library (ETANA-DL). First, 
ETANA-DL applies and extends the metadata harvesting approach to address 
some of the needs - interoperability, rapid access to data, and data preservation. 
Second, we show that availability of a pool of components that implement 
common DL services has helped in rapidly creating the prototype, which was 
subsequently used for requirements elicitation. However, understanding com- 
plex archaeological information systems is a difficult task. Third, therefore, we 
describe our efforts to model these systems using the 5S framework, and show 
how the partially developed model has been used to implement complex ser- 
vices helping users carry out key tasks with the integrated data. 



1 Introduction 

Archaeological research results in the production of vast quantities of heterogeneous 
information. Some of the kinds of digital objects include field records, GIS records, 
images, audio, video, 3D-models, and many more. Many projects in archaeology use 
custom information systems to store and process their information, generated both on 
the field, and inside laboratories. However, while these systems are of great utility, 
new problems arise because they are tailored to meet the needs of specific projects, 
are monolithic in nature, and more importantly use different schemes to store the 
information [5-7,15]. Thus, one of the three main problems we address in this paper 
is that achieving interoperability between these systems becomes very difficult. Ar- 
chaeological research greatly depends on typologies, comparisons, and existence of 
relationships with information from other projects/sites, things that are possible to 
accomplish in a real-time fashion only if these various systems are highly interoper- 
able. 



R. Heery and L. Lyon (Eds.): ECDL 2004, LNCS 3232, pp. 186-197, 2004. 
© Springer-Verlag Berlin Heidelberg 2004 



Prototyping Digital Libraries Handling Heterogeneous Data Sources 1 87 

A second problem is that primary data in archaeological research usually is avail- 
able to researchers outside a project/site only after substantial delay. What is desirable 
to speed up the transfer of knowledge is a highly efficient information system that 
would make the primary data available as soon as it is produced, and which can be 
used both in the field and out. A third problem is that many of the tailor-made ar- 
chaeological information systems do not provide a sustainable solution to long-term 
preservation and dissemination of information. Distributed and replicated existence of 
valuable information is necessary to ensure that the information is preserved for fu- 
ture use. 

Archaeological information systems that, in addition to storing and retrieving in- 
formation, provide services similar to those provided by modern Digital Libraries 
(DLs) would be highly desirable, affording solutions to the three problems described 
above. Having a single unifying system that is able to intelligently manage heteroge- 
neous information from several sites along with providing a rich array of services, 
including those specific to archaeology - GIS visualization systems, object compari- 
sons, complex workflow management, etc. - would greatly help archaeological re- 
search [81. 

Our approach to dealing with the problems presented above is to create a Digital 
Library for archaeology - ETANA-DL 1 . ETANA-DL is a model-based, extensible, 
componentized DL that manages complex information sources using the client-server 
paradigm of the Open Archives Initiative Protocol for Metadata Harvesting (OAI- 
PMH) [14]. 

In this paper, we demonstrate how the development of information systems (e.g., 
digital libraries) that address key needs, as described above, can be efficiently and 
effectively accomplished by applying and extending the Open Archives approach to 
metadata harvesting, and by building upon componentized frameworks like Open 
Digital Libraries [3,12,16]. We note that requirements elicitation is critical to the 
success of our DL, because the services that ETANA-DL will support depend on the 
requirements of the archaeologists. Thus, we use a prototyping approach to elicit 
requirements and describe the design and implementation of our initial Digital Li- 
brary as well as the various supported services. Our prototype is designed mainly 
from existing components, which saves development costs, if one has a good design 
based on an accurate model that reflects a deep understanding of the domain. How- 
ever, understanding complex information systems is a difficult task. Hence, we use 
the very powerful 5S framework in our modeling of ETANA-DL. We show how our 
partially developed model for archaeological information systems has been used to 
integrate heterogeneous data from disparate sources, and implement complex services 
over the integrated data [10]. 

The rest of this paper is organized as follows. Section 2 describes our efforts to 
model ETANA-DL using the 5S theory. Section 3 describes the architecture of the 
prototype, and describes our approach to address some of the needs discussed above. 
Section 4 gives an overview of the various services supported by the current 
ETANA-DL prototype. We provide an analysis of our approach in Section 5. Conclu- 
sions and future work are presented in Section 6. 



1 ETANA-DL home page: http://feathers.dlib.vt.edu 
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2 Modeling ETANA-DL 

Most archaeological projects approach the handling of data and information in di- 
verse ways. In order to address the problems of managing such heterogeneous data 
and processing in archaeology, we apply 5S to model archaeological data and proce- 
dures. The 5S (Streams, Structures, Scenarios, Societies, Spaces) framework is a 
comprehensive modeling tool that allows a DL designer to describe most aspects of a 
DL. 5S has already been used to create meta-models for education-oriented digital 
libraries [11]. We are creating a unified (meta-)model for archaeological systems 
based on the 5S framework for information systems. Figure 1 represents our archaeo- 
logical meta-model graphically, where relationships crossing sub-models show points 
where concepts are logically contained in one of the ‘S’ models, but are composed or 
work together with concepts in other Ss to define DL constructs. 

In the archaeological setting. Streams represent the enormous amount of dynamic 
multimedia information gathered and created by specialists. Examples include photos 
and drawings of excavation sites, loci, or unearthed artifacts, audio and video re- 
cordings of excavation activities, textual reports, and 3D models which are used to 
measure, reconstruct, and visualize archaeological ruins. 

Structures represent the ways archaeological information is organized along sev- 
eral dimensions. Examples include site organization, temporal organization, and tax- 
onomies of specific unearthed artifacts like bones and seeds. Particularly important is 
the structure of sites, since it defines the core units of knowledge in the archaeologi- 
cal DL. Generally, specific regions of archaeological interest are subdivided into 
sites, normally administered and excavated by different groups. Each site is further 
subdivided into partitions, sub-partitions and loci, the latter being the nucleus of the 
excavation. Material or artifacts found in different loci are organized in containers for 
further reference and analysis. 

Spaces model spatial and geographic distribution of found artifacts, as well as user 
interfaces, often employing metric or vector spaces, and are used to support retrieval 
operations, calculate distances, and constrain searches spatially. User classes defined 
in the Society model include archaeologists and the public who use DL services, the 
behavior of which is specified in the Scenario model. Besides Societies of users, 
service managers, which are electronic entities responsible for running services, also 
are specified in the Societies model. Scenarios and Societies act together to capture 
and model not only the services used by the public (search, browse, annotate, recom- 
mend), but also domain specific services for archaeological experts. More specifi- 
cally, we have identified four main general classes of DL services [9]: repository 
building, value added, domain specific, and information satisfaction. Components in 
the Space and Structure models also interact with each other. For example, coordinate 
systems and different taxonomies are used in metadata records to describe different 
parts of the site. 

This generic meta-model can be instantiated to create specific models of archaeo- 
logical DLs. One example is the ETANA-DL model shown in Figure 2, which uses 
the metamodel to try to unify the models of several archaeological systems. 
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Fig. 1 . Archaeological DL meta-model 




Fig. 2. ETANA-DL model 



Current Streams of content that are present in ETANA-DL include figurine im- 
ages, drawings and photos of (part of) the sites, and preliminary (final) reports. The 
core of ETANA-DL is its Union catalog which merges metadata harvested initially 
from three sites: Umayri (2], Nimrin [4], and Halif [1], Each site has its own organi- 
zation that can be mapped to portions of the meta-model. Partitions are large excava- 
tion units. The names of the partitions vary from project to project. They represent 
quadrants for Nimrin, but designate fields for Umayri and Halif. Sub-partitions are 
smaller units, which are typically a square, and within squares numerous loci are 
identified. A locus can be anything that is identifiable and distinguishable from its 
surroundings. They are typically the smallest excavation unit. Excavated materials 
and items (bones, seeds, and figurines) are collected in containers for preservation 
and analysis. Those containers can be bags (Nimrin), pails (Umayri), or baskets 
(Halif). Each site also has its own archaeological periods (chronology). The earliest 
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occupation in Nimrin is represented by stone and mud-brick walls from the Middle 
Bronze I (MBI) period (c. 2000 B.C.E.); Halif has occupation history from chalco- 
lithic through Modern Arab; and the chronology of Umayri is from Paleolithic 
through modern time. In ETANA-DL, each site mentioned above has its site-specific 
coordinate system, e.g., Nimrin used a Polar system. The ETANA-DL user interface 
is web-based. The ETANA-DL services are described in Section 5. 

Figure 3 illustrates the modeling of the unifying schema for the prototype. In the 
current prototype, ETANA-DL handles three kinds of digital artifacts: bones, seeds, 
and figurines. The digital records of each of these kinds of artifacts have attributes 
that are specific for that artifact type. However, many of the artifacts share common 
attributes. For example, every digital object in the prototype has spatial attributes 
associated with it. Therefore, these attributes are associated with a base class object. 




Fig. 3. ETANA-DL unifying XML schema - a design overview 



3 ETANA-DL Architecture 

We apply and extend the metadata harvesting approach of the OAI to address some of 
the needs of information systems in archaeology: heterogeneous data handling and 
interoperability, making primary data available, and long-term preservation. Figure 4 
illustrates the architecture of the current prototype. 

We convert partner archaeology sites into Open Archives by implementing data 
providers at the respective sites. We expose the metadata at each of these sites using a 
custom, unifying, metadata format that we have developed for the prototype, and one 
that will keep evolving as we ingest newer kinds of data from different sites [17]. 
This is a challenging task because of the custom schemas each of these sites use for 
storing their information. 

As shown in Figure 4, we have implemented semi-automatic data-mapping com- 
ponents that convert the data from its local view known to the system at a local site, 
to a global view known to ETANA-DL, and described in Section 2. By doing this, we 
shift the complexity of data mapping the service provider to the data provider. If the 
schemas for representing the data change at a local site, the system remains unaf- 
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fected, as only the data-mapping components at that local site need to be re- 
configured. Moreover, sites may not want to expose sensitive data. Such filtering is a 
part of the data-mapping layer, requiring the sites to customize only one layer of the 
system. 




Fig. 4. ETANA-DL Prototype Architecture 



Whenever a record is added or updated at the source, the time-stamp associated 
with the record changes. Service providers harvest all records from a data provider 
whose time-stamps have been updated since the archive was last harvested. Running 
the harvester at the service provider on a regular basis addresses the issue of the data 
changing at the source, and these changes being visible to the users of the DL. By 
implementing harvesters that can be configured to harvest data from the sites at fre- 
quencies proportional to the rate at which data changes at the respective sites, our 
approach make primary archaeological data available without significant delay. 

The data exposed by each site in the common format is harvested into a Union 
Catalog at the service provider (ETANA-DL) on a regular basis. We index the har- 
vested data in two formats: as inverted files to provide IR-like services, and as rela- 
tional databases to provide DB-like services. The search engine component uses the 
inverted files to provide search services, and the browse engine uses the relational 
databases to provide browse services that allow a user to navigate the various kinds of 
data in ETANA-DL. Other ETANA-DL services rely on the relational databases 
containing indexed archaeological data or custom databases to provide their function- 
ality. 

The web interface serves as the glue that binds the different services, some of them 
implemented as ODL components. All of the ODL components have been reused 
with little or no modification. These components communicate with each other, and 
the web interface using the XOAI protocol. XOAI is an extension to the OAI-PMH 
and provides the basis for developing inter-component communication protocols [ 16 ]. 
Other components are directly invoked by the web interface. 
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Replicating services providers (by creating mirror sites) that harvest data from the 
various Open Archives ensures replicated existence of data. Thus, by coupling our 
componentized approach to creating DLs with the metadata harvesting approach of 
the OAI, we provide a sustainable and easily maintainable approach that addresses 
long-term preservation of valuable archaeological information. 



4 ETANA-DL Services 

The ETANA-DL prototype supports several services. We describe the cross- 
collection browsing service in brief to demonstrate how the 5S model for ETANA- 
DL has been used to implement services over the integrated data. Table 1 provides an 
overview of the services, and their classification [9]. We also indicate which of the 
services have been implemented by re-using components in the current prototype. For 
more information on the current ETANA-DL services, refer to [13]. 

In the current prototype, there are three main dimensions for browsing the inte- 
grated data - by the structural organization of a site, by a time period taxonomy, and 
by taxonomies specific to the type of artifact. Browsing by the structural organization 
of a site is based on the 5S structural model described in Section 2. For the other 
dimensions, we have designed taxonomies for browsing, and map the harvested data 
to our generic taxonomies. 

The dynamic nature of the browsing system allows a user to see only those catego- 
ries for a dimension for which digital objects exist. The categories are chosen by 
querying the browse-index database at run-time, thereby allowing a user to freely 
move along any dimension, or a combination thereof. In addition, a user can search 
within a browsing context for information. This can be thought of as a way to restrict 
the search space using the information associated with a context. Figure 5 shows a 
sample interface for the browsing service. In this example, the user is browsing along 
all three dimensions. The interface shows the current context of the user, and allows 
the user to return to any of the previous contexts with a single click. 



Current user 
context — 



Browse ETANA-DL's collections 



Pth | 



You are In: Main > Nlmrln > Bone > IRON I > NW > N35/W20 



Location within site 
structural dimension 

Location within 
artifact specific 
taxonomies 



wawro* | 



Search within this context for | 

Browse within Sub-partition by Locus : > Nlmrln > NW > N35/W2Q 

Z3 . u , is , 72 Location 

^ within 

temporal 
dimension 



Browse within Period : > IRON I 

1000-900 BC 



Browse by name of Bone 

ANTLER , FEMUR , INNOMINATE , LONG BONE , MAXILLA f 
METAPODIAL , RIB , UNGULATE TOOTH , UNIDENTIFIED , VERTEBRA 



Fig. 5. Dynamic multi-dimensional browsing in ETANA-DL 
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Table 1 . Overview of services supported by the current ETANA-DL prototype 



Service 


Description 



Information satisfaction services 



Searching 

(Component-based) 


Allows users to search for specific information in the DL. 
Users can use the advanced search option to formulate 
queries that are more complex. 


Browsing 


Allows users to discover new information in the DL. 
Users can browse along many dimensions, and categories 
for browsing are generated dynamically based on the DL 
content. Users also can search within a context, save 
browsing contexts, etc. 


Recommendation 

(Component-based) 


Recommends digital objects in the DL that the user is not 
aware of, based on similarity of interest with other users. 


Domain-specific services 




Object Comparison 


Allows users to perform comparisons between different 
digital objects and view the results. Users specify the 
various parameters that form the basis for comparison. 


Marking items 


Allows users to direct specific digital objects to other 
users of the system. Users can include annotations that 
are only visible to specific other users. 


Value-added services 




Annotations 

(Component-based) 


Allows users to discuss the various digital objects in the 
system. Users can post messages, and other users can 
respond to the posted messages. 


Recent searches/discussions 


Allows users to view their most recent searches and 
recent on-going discussions. 


Items of Interest 


Binding service that allows users to create personal col- 
lections out of items in the DL that interest them. 


Miscellaneous services 




User management 


User registration, system login, and other user manage- 
ment functions. 


Collections description 


Allows users to view detailed information about various 
collections in ETANA-DL. 



5 Analysis 

In this section, we provide an analysis for our approach to creating DLs that handle 
heterogeneous archaeological data. We demonstrate that, given a pool of components 
that implement common DL services, a prototype that supports useful services can be 
rapidly generated. 
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Heterogeneous Data Handling 

The current ETANA-DL prototype harvests data from three different sites - Nimrin, 
Halif, and Umayri. We have converted these sites into Open Archives that partially 
expose their data, and have harvested records of different kinds of digital objects 
from these sites to prove the extensibility and scalability of our approach to handling 
heterogeneous archaeological data. Table 2 provides an overview of the data con- 
tained in the ETANA-DL Union Catalog. 



Table 2. Heterogeneous data in ETANA-DL - an overview 



Site 


Artifact 

Type 


Original data 
source 


Attributes 
in original 
record 


Attributes in 
harvested 
record 


Records 

harvested 


Halif 


Figurine 


Tab-delimited 
text file 


15 


18 


564 


Nimrin 


Bone field 
record 


Table in 
relational DB 


21 


24 


7420 


Seed field 
record 


Table in 
relational DB 


12 


15 


430 


Umayri 


Bone field 
records 


2 tables in 
relational DB 


8 


24 


2123 



In the current DL prototype, we have harvested bone records from two sites (Nim- 
rin and Umayri) to show the heterogeneity of our approach (being able to handle data 
from disparate sources). More than 10,000 digital records have been harvested from 
the three sites. The increase in the number of attributes for each object type in its 
global view is due to attributes associated with information about the collection, ob- 
ject type, etc. being added to the metadata associated with each record. 

Figure 6 shows a breakdown of the times required during various stages of con- 
verting an archaeological site into an Open Archive. It is evident that the majority of 
time required is in analyzing data (e.g., discovering relationships that exist in the 
schema at a local site) and mapping it to our unifying schema (more than 90%). Data 
and service provider implementation times include the time to implement the data 
provider, test the Open Archive, and harvest data into our Union Catalog, and are 
only a fraction of the time to analyze and map the data (less than 10%) due to the 
availability of easily configurable components. 




□ Data Analysis 
■ Data Mapping 

□ Data Provider Implementation 

□ Service Provider Implementation 



Fig. 6. Breakdown of times required during the various stages of conversion of a site into an 
Open Archive for the current prototype 
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Table 3. Analysis of prototype using the metric of Lines of Code 



Type of Service 


LOC for imple- 
menting service 


LOC reused 
from compo- 
nent 


Total LOC 


Reuse Per- 
centage 


Componentized 


350 


3630 


3980 


91 


Non-componentized 


7950 


- 


7950 


- 


Total 


8300 


3630 


11930 


30.4 



Rapid Prototyping 

We used two software metrics to analyze the rapidity of our prototyping efforts: 
Lines of Code required for implementing services, and service development times. 

Table 3 shows the Lines of Code (LOC) needed to implement componentized and 
non-componentized services. The final column in the table is the percentage savings 
in LOC gained from component re-use. It is clear that we can re-use a very significant 
percentage (approximately 30%) of DL code by designing common DL services as 
components. Moreover, for creating prototypes rapidly where quality of the service is 
as important as the speed with which services can be put together and modified, the 
approach to building DLs using pre-existing components is very useful. The compo- 
nents that we have developed for the prototype can be re-used in other DLs, thus 
resulting in an even higher re-use percentage value. 

Prototyping services involve three stages: requirements analysis and design, im- 
plementation, and testing. Figure 7 shows a comparison of efforts (measured in de- 
velopment time) for the various stages of the prototyping cycle. Chart A shows the 
percentages of time required for the component-based services whereas Chart B 
shows the same for non-componentized services. It is clear from comparisons that 
more efforts can be spent on analysis, design, and testing for component-based ser- 
vices as compared to non-componentized services. Thus, a DL implementer can save 
a significant percentage of implementation time by re-using components that imple- 
ment common DL services. 




Fig. 7. Service development time percentages for various stages of prototyping: (A) componen- 
tized services, and (B) non-componentized services 



User Analysis 

Our prototype was evaluated by some members of the archaeology community, the 
results of which can only be summarized here because of space restrictions [13]. All 
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the services provided in the current ETANA-DL prototype were found to be useful. 
Some of the services like advanced search and object comparisons, and some features 
of the multi-dimensional browsing service, were found to be less intuitive and diffi- 
cult to use. Nevertheless, the organization of the browsing service around a generic 
site structure and common taxonomies (as described in Section 2) was seen as a plus. 
Moreover, we have been able to elicit many requirements for improving the current 
services using our rapidly generated prototype. When addressed, the utility of the 
system will increase. Our approach to handling heterogeneous archaeological data is 
further validated by the positive feedback for the cross-collection searching and 
browsing services supported by the current prototype. 



6 Conclusions and Future Work 

This paper has described our experiences in creating a prototype for a Digital Library 
for handling heterogeneous archaeological data. We have demonstrated that we can 
use digital library techniques to address some of the long-standing needs of archaeo- 
logical information systems, and have demonstrated a scalable, and easily manage- 
able approach for handling heterogeneous archaeological data from disparate sources. 
We have shown that given a pool of DL components that implement common DL 
services, a prototype for a DL providing useful services can be rapidly implemented, 
which can then be used to better understand requirements of users from the DL. The 
5S framework has helped us in understanding complex archaeological information 
systems, and the resulting partially developed 5S archaeological model has been used 
to integrate heterogeneous data and guide useful services. 

Luture work on the prototype includes creating next generation DL services that 
address the requirements that we have gathered using the prototype, integrating richer 
content into the DL by extending the unifying metadata schema to cover more 
sites/artifact types, enhancing information access services (e.g., searching, browsing) 
to allow archaeologists to retrieve information easily, and extensive usability studies. 
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Abstract. This paper describes the provision of a Web Services inter- 
face that will extend the possibility of machine-to-machine access to the 
zetoc current awareness service, within the JISC Information Environ- 
ment and eScience applications. This bespoke interface includes open 
standard XML metadata for searches and responses where possible. El- 
ements from the OpenURL XML metadata formats for journals and 
books are used to transmit the bibliographic citation information that 
is an integral part of a zetoc record for a journal article or conference 
paper. 

Keywords: Web Services, bibliographic citation, metadata, OpenURL, 
Dublin Core, SRW. 



1 Introduction 

zetoc [1] [2] is a current awareness and document delivery service based on the 
British Library’s [3] Electronic Table of Contents of journal articles and confer- 
ence papers. Hosted at MIMAS [4], zetoc is available to researchers, teachers 
and learners in UK Higher and Further Education under a ‘strategic alliance’ 

[5] , and to practitioners within the UK National Health Service. The zetoc 
database, updated daily, contains details of articles from approximately 20,000 
current journals and 16,000 conference proceedings published per year. With over 
20 million article and conference paper records from 1993 to date, the database 
covers every imaginable subject in science, technology, medicine, business, law, 
finance and the humanities. Human users can search zetoc through its Web 
interface to retrieve articles by one of the document delivery options. They may 
also use the popular email alert service to maintain current awareness of new 
articles of possible interest. Machine-to-machine searching is available by Z39.50 

[6] , the NISO (North American National Information Standards Organization) 
standard for information retrieval that provides a protocol for two computers to 
communicate and share information. It is also enabled by OpenURL [7], another 
standard way of passing information between machines, zetoc being enabled as 
an OpenURL ‘link-to’ resolver. 

A new Web Services SOAP [8] (the World Wide Web Consortium’s server- 
to-server protocol for object retrieval) interface to zetoc has been developed as 
part of the A2Z (Akenti acces to zetoc) project [9], the main purpose of which 
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was to investigate digital certificate authentication, in particular using Akenti 

[10] , within eScience applications as well as within the Information Environment 

[11] under development by the JISC (the Joint Information Services Committee 
of the UK Higher and Further Education Funding Councils). Because eScience 
projects are generally outside the digital library domain, experimenting via the 
zetoc Z39.50 interface did not seem appropriate. Whereas a Web Services inter- 
face would be suitable for use in areas such as workflow modelling of composite 
services within eScience projects such as myGricl [12]. Within the digital library 
based JISC Information Environment portals and virtual learning development 
projects are beginning to use Web Services for machine-to-machine communica- 
tion. 

2 A SOAP Interface for zetoc 

A Web Services interface deals with messages that are sets of XML elements 
wrapped within SOAP envelopes. The requesting and responding servers, both 
understanding the SOAP protocol, are able to extract the XML data from the 
messages sent to them and to package their responding XML results accordingly. 
SOAP messages are passed between machines by RPC (Remote Procedure Call) . 
The zetoc SOAP interface is implemented by RPC over the Web Common 
Gateway Interface (CGI), and thus its address appears like a URL. 

Provision of a Web Services interface is in two parts. Firstly a ‘search request’ 
is needed to submit search terms for discovery to the application. Secondly a 
‘search response’ will return details of the results. To design the zetoc SOAP 
interface there appeared to be two options: use a standard or generally accepted 
schema; or develop a bespoke interface, that is an interface designed specifically 
for the particular service. 

SRW (Search - Retrieve - Web) [13] is a specification for general search re- 
quest and response developed under the auspices of the Z39.50 community, with 
the possibility of becoming a NISO standard for meta-searching [14]. SRW emu- 
lates Z39.50 by including various fields to return the number of search hits and 
to request the start position within the result set of the returned records. For 
the actual search SRW provides a Common Query Language (CQL) to enable 
Z39. 50-like and interoperable search requests. The expected returned response 
for each record within the result set is simple Dublin Core [15]. 

In fact the zetoc SOAP interface developed as part of the A2Z project is 
bespoke, although based on open standard metadata schemes where possible. It 
seems that SRW is ideal for distributed searching within a wide domain such as 
the JISC Information Environment because it allows common search requests 
to be sent to a range of services. However SRW seems less appropriate for mak- 
ing a connection to a single service whose capability and specific domain is well 
understood. Similarly returning simple Dublin Core records provides clear in- 
teroperability for distributed searching. But a simple Dublin Core description 
for a result that is a bibliographic record would lose richness and significant de- 
tail, in particular the bibliographic citation information for a journal article or 
conference paper. 
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The XML elements that make up the various requests and responses of the 
zetoc SOAP interface are defined formally [16] as a Dublin Core Application 
Profile [17]. An application profile was a useful way to document all of these 
properties and their corresponding namespaces. However strictly a Dublin Core 
application profile is a flat structure, so some distortion of the application pro- 
file has been made to specify the hierarchical structure of the returned result 
set. This application profile effectively defines the zetoc-specific terms within a 
zetoc namespace. 



3 Metadata for zetoc SOAP 

Having decided to implement a bespoke SOAP interface for zetoc, it was desir- 
able to use metadata properties from open standards wherever possible. 



3.1 zetoc Search Request 

The available zetoc SOAP search requests replicate the searches available on 
the zetoc Web interface. Thus three search requests are provided: general that 
searches over all the data; journal article; and conference paper. The search 
fields include the obvious possibilities such as ‘all fields’, article title, author, 
publication year, and ISSN, to specify a journal, or ISBN, a book identifier to 
specify a conference proceedings. The journal and conference searches include 
more specific fields related to those genre. To support the retrieval of large result 
sets in manageable chunks the zetoc SOAP search requests also need to indicate 
the position within the result set of the first record to be returned. Currently 
no Boolean operators are available. As in the zetoc Web interface, when several 
search terms are provided the implicit Boolean operator is ‘and’. 

There is an additional, fourth, ‘identifier’ request that returns a single zetoc 
full record corresponding to a specific zetoc identifier. 



3.2 zetoc Response 

The three search requests result in a search response that is a list of brief de- 
scriptions of zetoc records matching the search. The ‘identifier’ request results 
in a single, full zetoc record. 

To avoid returning unmanageably large result sets, the zetoc search response 
is a list of a fixed number (25) of brief records. Thus the response must include 
the total number of hits and the number of the next record in the result set 
following those returned. Along with the ‘first record position’ requested, this 
data enables repeated requests to obtain the full result set. An indication of the 
search performed is also returned. 

The brief records returned correspond to the zetoc brief records available 
from a Web search, including the position of the record within the entire result 
set, but with the addition of the zetoc identifier. Returning the zetoc identifier 
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with the brief record enables a subsequent ‘identifier’ request to retrieve the full 
details of an item of particular interest. 

An ‘identifier’ response results in a single, full zetoc record that corresponds 
to a full zetoc record available from the Web interface. It includes all details 
about the article or paper available from the database. 

3.3 Dublin Core Properties 

For maximum interoperability properties are taken from Dublin Core where 
possible. Thus from the simple Dublin Core namespace (‘dc’) the following 
terms represent: dc:title, the article or paper title; dcxreator, the authors; and 
dcddentifier, the identifier of the resource within the zetoc database, currently 
an identifier local to the zetoc service. In a search request ‘title’ and ‘creator’ 
could contain keywords from the held rather than the entire value. In addition 
dc:subject (for conference keywords), dcxontributor (for editors), dc:publisher, 
dcdanguage, dc:format and dc:type are used to return some detailed information 
in an ‘identifier’ full record response. 

From the wider Dublin Core namespace (‘determs’) the following properties 
represent: dcterms:issued, the publication year of an article; and 
dcterms:bibliograplricCitation, the citation details in a brief record of a search 
response. 

3.4 SRW and Z39.50 Bath Profile Properties 

The SRW namespace includes obvious properties to implement the retrieval of 
large result sets in manageable pieces. Thus from the SRW namespace (‘srw’) are 
taken: srwmumberOfRecords, the total number of search hits; srw:startRecord, 
the requested start position; srw:nextRecordPosition, the number of the record 
following those returned; and srw:recordPosition, the number within the result 
set of each brief record returned. 

The Bath Profile [18] is a derivation of Z39.50 for digital library applications, 
defining search request attributes. From this namespace (‘bath’) these search re- 
quest terms are taken: batluany, an ‘all fields’ search; and bathxonferenceName, 
the conference details in a conference paper search. 

3.5 OpenURL Properties 

Because zetoc is a citation database providing bibliographic information to en- 
able article requests, it is essential that the zetoc SOAP interface includes bib- 
liographic details in its search requests and responses. It was preferred that 
this bibliographic information be passed using open standard properties where 
possible. Dublin Core provides ‘dcterms:bibliographicCitation’, which is used to 
return the information as a string value within a brief record response. But 
Dublin Core does not provide bibliographic properties at any finer granularity. 

OpenURL was developed as a standard way of passing information about a 
resource between a source application and an OpenURL-aware resolver [19]. Its 
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original and primary purpose is to enable a researcher to link from a referenced 
article to a full text copy of that article where the researcher’s institution has 
a valid subscription to read the article. During the process of proposing the 
OpenURL Framework as a NISO standard, Z39. 88-2004 [20], other possible uses 
of OpenURL were envisaged including server-to-server communication. Thus an 
XML schema for the OpenURL ‘payload’ (the ContextObect) was developed. 
The OpenURL Framework is extensible by means of a Registry [21]. The initial 
content of the OpenURL Registry, and hence the standard, includes metadata 
formats as XML schema for journals and books. 

The OpenURL journal [22] (‘oujnl’) and book [23] (‘oubook’) metadata for- 
mats are used to capture the bibliographic citation properties within zetoc 
SOAP. Thus from the OpenURL journal metadata format are taken: oujnkjtitle, 
the journal title; oujnkissn, the journal ISSN; oujnkvolume, oujnkissue and ou- 
jnhspage, the volume and issue number and start page of an article within a 
journal search request; and oujnhpages, the page range of an article or paper in 
a full record response. Similarly from the OpenURL book metadata format are 
taken: oubook:isbn, the ISBN of a conference proceedings; and oubook:spage, 
the start page within a conference paper search. 

3.6 zetoc Properties 

Although open standards are used as far as possible it was necessary to include 
several zeto c-specific properties within a zetoc namespace. This namespace 
includes all the containing XML elements of zetoc SOAP comprising the search 
and ‘identifier’ requests and responses, and the brief record and its containing 
array. The only zetoc term within the search requests is a held ‘ISSN or ISBN’ 
included in the general search. The only zetoc term in a search response holds 
a string value representing the search performed on the zetoc database. 

Inevitably the full record ‘identifier’ response includes several zeto c-specific 
terms, for example the British Library’s ‘shelfmark’ and ‘location’ information, 
and the frequency of publication for some journals. A ‘zetoc:type’ property in- 
dicates whether a returned record is for a journal article or a conference paper. 

Subject terms are available in zetoc as Dewey and Library of Congress 
Classification. These are returned as properties ‘dewey’ and ‘lccn’ in the ze- 
toc namespace. Ideally they would be returned as ‘dc:subject’ with an XML 
attribute ‘xsktype’ of ‘dcterms:DDC’ or ‘dcterms:LCC’, something that may be 
implemented in future versions of zetoc SOAP. 

Within the zetoc database journal volume and issue information is run to- 
gether into a single Held, reflecting the data supply from the British Library, 
thus necessitating another zeto c-specific property ‘volissue’. A zeto c-specific 
term is used to return any journal issue title, for example the name of a special 
issue, recorded in zetoc, this being outside the scope of general journal metadata 
formats. 

There did not appear to be an existing open standard metadata scheme to 
describe conference details. It would be possible to record the proceedings as a 
book title using the OpenURL book metadata format, but this would not nec- 
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essarily capture all the data about the conference such as its venue and date. 
Thus a term within the zetoc namespace is used to return all the conference de- 
tails concatenated into a string value. Another zetoc field returns the conference 
sponsors. 

3.7 Examples 

Journal Search. Some possible fields in a journal search request may be as in 
Table 1. 



Table 1. Example journal search terms. 



Property 


Value 


dc:creator 


apps 


oujnhjtitle 


materialia 


oujnbissn 


1359-6462 


ouj nl: volume 


48 


ouj nl: issue 


5 


oujnbspage 


475 


dcterms:issued 


2003 



Search Response. A search response for the above example would return a 
list of brief records containing the single record shown in Table 2. 



Table 2. Example brief record response. 



Property 


Value 


srw:recordPosition 


1 


dc:title 


Phase compositions in magnesium-rare earth alloys 


dcxreator 


Apps, P. J.; et-al 


dcterms:bibliographicCitation 


SCRIPTA MATERIALIA - 2003; 
VOL 48; NUMBER 5; Pages: 475-481 


dc:identifier 


RN125218404 



‘Identifier’ Response. The ‘identifier’ full record response (omitting confer- 
ence paper properties that are irrelevant to this article) would be as in Table 3. 

3.8 An Alternative ‘Identifier’ Response 

An alternative approach to implementing a full record ‘identifier’ response would 
be to return a simple Dublin Core record for the discovered article, including 
salient information such as its title and authors. This simple Dublin Core record 
would contain a ‘by-reference’ link, a pointer as the value of a ‘dc:relation’ prop- 
erty, to a full zetoc XML record. This pointer could be an OpenURL that 
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Table 3. Example full record response. 



Property 


Value 


srw : numb er OfRecords 


1 


dc:identifier 


RN125218404 


zetoc:type 


J (ie. journal) 


dc:title 


Phase compositions in magnesium-rare earth alloys... 


dcxreator 


Apps, P. J.; Karimzadeh, H; King, J. F.; Lorimer, G. W. 


zetoc:dewey 


669 


zetoc :lccn 


TT273 


oujnbjtitle 


SCRIPTA MATERIALIA 


oujnhissn 


1359-6462 


zetoc:volissue 


VOL 48; NUMBER 5 


oujnkpages 


475-481 


dcterms:issued 


2003 


dc:publisher 


Great Britain : Elsevier Science B.V., Amsterdam. 


zetoc:frequency 


Fortnightly: 15-30 issues per year 


dcdanguage 


English 


zetoc:shelfmark 


8212.970000 



would return an XML record for the item in zetoc as in the following exam- 
ple. Note that in this example: a hypothetical resolver address is used, and an 
actual OpenURL would be ‘URL escape encoded’, with special characters in 
hexadecimal format for safe transmission, but this encoding has been omitted, 
and line-breaks have been added to the OpenURL, for readability. 

http : //zetoc .mimas . ac .uk/ openurl/linkto? 
url_ver=Z39 . 88-2004 

&url_ctx_fmt=inf o : of i/fmt :kev:mtx : ctx 
&rft_val_fmt=inf o : of i/fmt :kev:mtx : dc 
&rft . identif ier=RN125218404 
&svc_val_fmt=inf o : of i/fmt :kev:mtx : dc 
&svc . f ormat=text/xml 

This OpenURL uses the Dublin Core metadata format to describe the zetoc 
record recquired, the referent, as a zetoc identifier. That identifier being local 
to zetoc makes an OpenURL ‘referent-identifier’ key inappropriate. The Dublin 
Core metadata format is also used to request a service type that returns an XML 
record. 

An alternative OpenURL could use private data to pass the zetoc identifier, 
in which case the fourth and fifth lines of the above example would be replaced 
by: 

&rft_dat=RN125218404 

If the zetoc identifier were to become a URI, possibly by registering it within 
the new ‘info’ URI scheme [24], then the fourth and fifth lines of the above 
example could be replaced by the preferable: 

&rft_id=inf o : zetoc/RN125218404 
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The returned record could possibly be an XML Dublin Core description 
with related metadata using the OpenURL journal or book metadata format as 
suggested in [25]. 

The advantage of this approach is that the ‘identifier’ response would return 
an interoperable simple Dublin Core record. The disadvantage is that any service 
retrieving this record would have to make a further retrieval to obtain the full 
zetoc reord, including its bibliographic citation information that could not be 
captured in the simple Dublin Core record. This approach would be suitable for 
returning a simple Dublin Core record from a zetoc SRW implementation. 

4 Authentication 

zetoc is available to members of institutions in UK Higher and Further Ed- 
ucation and the UK National Health Service. It is also provided by modest 
subscription to various other bodies in UK academia, including the research 
councils, and to institutions in Ireland. Authentication is firstly by a machine 
domain name (DNS) or IP address check, and failing that by Athens [26], the ac- 
cess authorisation system used within UK Higher and Further Education, zetoc 
SOAP allows access using the same machine address checks. Access via Athens 
is not supported, human intervention not being possible. The same terms and 
conditions for the use of zetoc apply. This means that any portal must first 
check that a user has a right to use zetoc before providing a search through the 
zetoc SOAP interface. 

The A2Z project has investigated and successfully demonstrated the use 
of Akenti digital certificate authenticated access to the zetoc Web interface, 
as reported eslewhere [9]. The original intention of the ‘Web Services’ part of 
the A2Z project was to investigate the use of digital certificate authenticated 
access to a zetoc SOAP interface within an eScience application such as myGrid. 
However it became apparent that this was not viable within the time frame of the 
project because digital certificates are not yet in use by the potential user base. 
Thus providing digital certificate authentication has not been taken forward. 
It would simply involve replacing the current machine address authentication 
module with the A2Z digital certificate ‘black box’ module and installing the 
access point to zetoc SOAP on the A2Z secure server. 

5 Implementation 

zetoc SOAP is implemented in C+- 1- using gSOAP. gSOAP is a set of compiler 
tools that provide a SOAP /XML-to-C+- 1- language binding to ease the develop- 
ment of SOAP/XML Web services and client applications in C++. Developed by 
Prof Robert van Engelen and his team in the Department of Computer Science 
and School of Computational Science and Information Technology at Florida 
State University, USA [27], it is available under a GNU licence from Source- 
Forge [28]. gSOAP is used to implement several major applications, including 
Adobe Version Cue, an innovative file-management feature of Adobe Creation 
Suite. 
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gSOAP takes care of all the details of the XML to support the SOAP pro- 
tocol and also the serialisation of the XML elements of the zetoc requests and 
responses to and from the C++ public data of the zetoc SOAP server imple- 
mentation. gSOAP also generates the requisite WSDL (Web Services Descrip- 
tion Language) file that provides a machine readable definition of the interface. 
Requests are translated into searches in the format of the underlying Livelink 
Discovery Server (previously known as BRS/Searclr) [29] database, the searches 
being performed by existing C++ code modules. A zetoc SOAP client was 
implemented with gSOAP alongside the server for testing. 



6 Conclusion 

The use of Web Services is becoming increasingly important for maclrine-to- 
machine communication. It is already used within eScience Grid applications 
and projects. It is mandated for machine-to-maclrine applications within the 
UK government’s interoperability framework (eGif) [30]. Within the JISC In- 
formation Environment portals are starting to use Web Services, the Resource 
Discovery network (RDN) [31] has a SOAP/SRW interface, and collections with 
Web Services access are recorded in the Information Environment Service Reg- 
istry [32]. 

As discussed above, the zetoc SOAP interface is bespoke rather than using 
SRW. The short timescale available for the development of the zetoc SOAP 
interface within the funding of the A2Z project did not allow for any investigation 
into the provision of an SRW interface. A future SRW interface for zetoc, given 
the availability of funding, will be developed to allow its inclusion in Web Services 
distributed search requests within the JISC Information Environment, although 
distributed searching is already enabled via Z39.50. But it seems appropriate to 
provide an interface to an application that is specific to its data and purpose. The 
Common Query Language of SRW will provide interoperability but it seems to 
be too general when specific requests and results of an application are required. 
Also SRW will allow search requests inappropriate to an application resulting in 
null or distorted responses. For example a search for ‘dc:description’ in zetoc 
would return results from around 1994 only, later records not having abstracts. 

Similarly the requirement of SRW to return simple Dublin Core records pro- 
vides an interoperable result set for a distributed search but it does not cater for 
the return of richer application-specific details, zetoc SOAP provides a bespoke 
result format to include the important bibliographic citation details necessary 
to make use of any record from zetoc. There is no recommended way to in- 
clude bibliographic citation information about a resource within a simple Dublin 
Core record. The alternative approach given above in section 3.8 would resolve 
this problem if zetoc were to provide an interoperable SRW impementation 
to support distributed searching. But it seems that a retrieval by a server that 
understands the zetoc application would be simpler using the bespoke interface. 

Resembling the zetoc Web search interface, the zetoc SOAP interface does 
not allow the inclusion of Boolean operators in search requests, all search terms 
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being implicitly ‘anded’. Future developments to the zetoc SOAP interface 
would investigate the provision of Boolean operators between search terms when 
assembling searches on the underlying database. This functionality would be en- 
abled if the Common Query Language of SRW were supported. 

Developing zetoc SOAP has been a useful experience in exploring the design 
and specification of such an interface and the issues involved. Investigation into 
alternative implementations, such as SRW, was limited by the short timescale 
of this part of the A2Z project. But this development will provide a prototype 
for Web Services implementations for further information collections. 
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Abstract. Because of recent advances in graphics hard- and software, both the 
production and use of 3D models are increasing at a rapid pace. As a result, a 
large number of 3D models have become available on the web, and new research 
is being done on 3D model retrieval methods. Query and retrieval can be done 
solely based on associated text, as in image retrieval, for example (e.g. Google 
Image Search [1] and [2,3]). Other research focuses on shape-based retrieval, 
based on methods that measure shape similarity between 3D models (e.g., [4]). 
The goal of our work is to take current text- and shape-based matching methods, 
see which ones perform best, and compare those. We compared four text match- 
ing methods and four shape matching methods, by running classification tests us- 
ing a large database of 3D models downloaded from the web [5], In addition, we 
investigated several methods to combine the results of text and shape matching. 
We found that shape matching outperforms text matching in all our experiments. 
The main reason is that publishers of online 3D models simply do not provide 
enough descriptive text of sufficient quality: 3D models generally appear in lists 
on web pages, annotated only with cryptic filenames or thumbnail images. Com- 
bining the results of text and shape matching further improved performance. The 
results of this paper provide added incentive to continue research in shape-based 
retrieval methods for 3D models, as well as retrieval based on other attributes. 



1 Introduction 

There has been a recent surge of interest in methods for retrieval of 3D models from 
large databases. Several 3D model search engines have become available within the last 
few years (e.g., [6-9]), and they cumulatively index tens of thousands of 3D polygonal 
surface models. Yet, still there have been few research studies investigating which types 
of query and matching methods are most effective for 3D data. Some 3D model search 
engines support only text queries [6], while others provide “content-based” queries 
based on shape [4]. But how do shape-based and text-based retrieval methods compare? 

To investigate this question, we measured classification performance of the cur- 
rently best-performing text-based and shape-based matching methods. We also evalu- 
ated several functions that combine text and shape matching scores. For the text match- 
ing, a 3D model is represented by a text document, created from several sources of 
text associated with the model, as well as synonyms and hypernyms (category descrip- 
tors) of the 3D model filename (added using WordNet, a lexical database [10]). For the 
shape matching, a 3D model is represented by a shape descriptor, computed from the 
polygons describing the model’s surface. 

R. Heery and L. Lyon (Eds.): ECDL 2004, LNCS 3232, pp. 209-220, 2004. 

© Springer- Verlag Berlin Heidelberg 2004 
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All classification tests were done using the “Princeton Shape Benchmark” (PSB) 
3D model test database [5]. It contains 1814 3D models downloaded from the web, 
subdivided into a training set and a test set, containing 907 models each, manually clas- 
sified into 90 and 92 comparable classes respectively. It is a subset of a larger database 
of about 33000 models downloaded from the web using an automatic crawler [11]. 
Retrieval results were evaluated using precision/recall curves [12]. 

We found that shape-based matching outperforms text-based matching in all our 
experiments. The main reason is that 3D models found on the Web are insufficiently 
annotated. They usually are presented in lists, annotated with at most a single name, 
which is often misspelled or a repeat of the filename. Of the available text sources, we 
found that the text inside the model file itself and the synonyms and hypernyms of the 
filename were the most discriminating. Additionally, we found that for combining the 
results of the shape and the text matching method a function returning the minimum of 
normalized scores showed the best performance. 

The contribution of this paper is a comparison of text and shape matching methods 
for retrieval of online 3D models, and an evaluation of several different combination 
functions for combining text and shape matching scores. This paper shows that the 
relatively simple solution of using only associated text for retrieval of 3D models is not 
as effective as using their shape. 

The rest of this paper is organized as follows. Text matching and our approach for 
maximizing text retrieval performance is described in the next section. Section 3 dis- 
cusses shape matching and shows the performance of several recent shape matching 
methods. Text and shape matching are compared in Section 4 and combined in Sec- 
tion 5. Conclusions and suggestions for future work are in Section 6. 



2 Text Matching 

In this section we review related work on retrieval of non-textual data using associated 
text. Note that we do not discuss text retrieval itself. We refer the interested reader to 
[13-15]. We then describe the sources of text found with 3D models crawled from the 
web and investigate how well the text can be used to compute a similarity measure for 
the associated 3D models. 

2.1 Related Work 

There has been relatively little previous research on the problem of retrieving non- 
textual data using associated text. The web is an example of a large database for which 
such methods can be useful: (1) it contains many non-textual objects (e.g., images, 
sound files, applets) and (2) these objects are likely to be described on web pages using 
text. Examples of web search engines that take advantage of associated text are Google 
Image Search [1] (for images), FindSounds [16] (for sound files), and MeshNose [6] 
(for 3D models). 

Probably the largest site for searching images using text keywords is Google’s image 
search. Unfortunately there are no publications available about the method they use. A 
related FAQ page suggests that heuristics are used to determine potentially relevant text 
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related to an image, for example, the image filename, link text, and web page title. Each 
source is probably assigned a different weight, depending on its importance, similar to 
how the main Google search site assigns weights to text terms depending on whether 
they are in the title, headers, and so on. 

Sable and Hatzivassiloglou investigated the effectiveness of using associated text 
for classifying images from online news articles as indoor or outdoor [17], They found 
that limiting the associated text to just the first sentence of the image caption produced 
the best results. In other work. Sable el al. use Natural Language Processing (e.g., iden- 
tifying subjects and verbs) to improve classification performance of captioned images 
into four classes [18]. Our problem is slightly harder since our source text is less well- 
defined (i.e., there is not an obvious “caption”), and the number of classes is much 
higher. To our knowledge, there has never been a study investigating the effectiveness 
of text indexing for 3D models. 

2.2 Text Sources 

In our study, we focus on the common “bag of words” approach for text matching: all 
text that is deemed relevant to a particular 3D model is collected in a “representative 
document,” which is then processed and indexed for later matching. This document 
is created using several potentially relevant text sources. Because we are indexing 3D 
model files linked from a web page, we are able to extract text from both the model file 
itself as well as the web page (note that because we convert all models to the VRML 2.0 
format, we only refer to text sources of this format). The following list describes the text 
sources we can use: 

From the model file: 

1 . model filename: The filename usually is the name of the object type. The extension 
determines the hletype. For example, alsation.wri could be the filename of a 
VRML file of an Alsation dog 

2. model filename without digits: From the filename we create a second text source 
by replacing all digits with spaces. Very often filenames contain sequence numbers 
(for example, chair2 . wrl) that are useless for text keyword matching 

3. model file contents: A 3D model file often contains labels, metadata, filenames of 
included files, and comments. In VRML, it is possible to assign a label to a scene- 
graph node (a part of the model) and then re-use that node elsewhere in the file. 
For example, in a model of a chair, a leg can be defined once, assigned the identi- 
fier LEG, and then re-used three times to create the remaining legs. As such, these 
identifiers typically describe names of parts of the model. To describe metadata, a 
VRML 2.0 file may contain a Worldlnf o node, which is used to store additional 
information about the model, such as a detailed description, the author name, etc. 
Filenames of included files can be names of other model files, textures, or user- 
defined nodes. Finally, a model hie may contain descriptive comments. The model 
hie comments were left out from our experiments because we found that many hies 
contain commented-out geometry, which, when included, would add many irrele- 
vant keywords 
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From the web page: 

4. link text: This is the descriptive text of the hyperlink to the model file, i.e., the text 

between the <a> and </a> HTML tags. For example: <a href = 

"b747.wrl"> a VRML model of a Boeing 747</a> 

5. URL path: These are the directory names of the full URL to the model file. If 
multiple models are organized in a directory structure, the directory names could 
be category names helpful for classification. For example, as in the URL 

http : //3d. com/obj ects/ chairs /chair4 . wrl 

6. web page context (text near the link): We define the context to be all plain text 
after the </a> tag until the next <a href > tag (or until the next HTML tag if 
there is none). This text could for example read “1992 Boeing 747-400 passenger 
plane, 210K, created by John Doe’’. Context found before the link text was found 
to be mostly irrelevant 

7. web page title: The title of the web page containing the link to the 3D model. It 
often describes the category of models found on the page, for example, “Models of 
Airplanes” 

Additional text source: 

8. Wordnet synonyms and hypernyms: We create an additional eighth text source 
by adding synonyms and hypernyms (category descriptors) of the filename using 
WordNet, a lexical database [10] (if no synonyms or hypernyms can be found for 
the filename, the link text is tried instead). In related work, Rodriguez et al. use 
WordNet synonyms [19], and Scott and Matwin use synonyms and hypernyms [20] 
to improve classification performance. Recently, Benitez and Chang showed how 
WordNet can be used to disambiguate text in captions for content-based image 
retrieval [21]. Adding synonyms and hypernyms enables queries like “vehicle” to 
return objects like trucks and cars, or “television” to return a TV. WordNet returns 
synonyms and hypernyms in usage frequency order, so we can limit the synonyms 
and hypernyms used to only the most common ones. 

Following common practices from text retrieval, all collected text goes through a 
few processing steps. First, stop words are removed. These are common words that do 
not carry much discriminating information, such as “and,” “or,” and “my”. We use the 
SMART system’s stop list of 524 stop words [22], as well as stop words specific to our 
domain (e.g. “jpg,” “www,” “transform”). Next, the resulting text is stemmed (normal- 
ized by removing inflectional changes, for example “wheels” is changed to “wheel”), 
using the Porter stemming algorithm [23]. 



2.3 Text Matching Methods 

Given a representative text document for each 3D model, we can define a textual sim- 
ilarity score for every pair of 3D models as the similarity of their representative text 
documents. To compute this score, we use a variety of text matching methods provided 
by rainbow, a program of the Bow toolkit, a freely available C library for statistical 
text analysis [24]. The tested methods were: three variations of TF/IDF [13], Kullback 
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Fig. 1 . Average precision/recall for four types of text matching methods, and random retrieval. 

Leibler [25], K-nearest neighbors, and Naive Bayes [26]. A few methods supported by 
the rainbow program (e.g., Support Vector Machines) were also tested but failed to 
run to a finish. 

Figure 1 shows the resulting precision/recall results obtained when the representa- 
tive text document for each 3D model in the test set of the Princeton Shape Benchmark 
was matched against the representative text documents of all the other 3D models. The 
matches were ranked according to their text similarity scores, and precision-recall val- 
ues were computed with respect to the base classification provided with the benchmark. 
From this graph, we see that the “TF/IDF log occur” method shows the best perfor- 
mance for our dataset, and thus we used this method for all subsequent tests. 

To determine the most useful combinations of text sources, we ran a classification 
test using each combination of n out of the eight text sources for the representative 
text document, with n € {1, ..., 8} (so the total number of combinations tested was 
]Cn=i («) = 255). The performance of each combination was measured as the average 
precision over twenty recall values. 

From these tests, we found that adding as many text sources as possible improves 
overall performance, in general. This may be explained by our observation that the 
addition of keywords helps classification performance if the keywords are relevant, but 
does not hurt performance if they are irrelevant, since they do not match many other 
models. We expect that as the database size increases, this property will no longer hold 
because irrelevant keywords would generate cross-class matches. 

Looking more closely at how often each source occurs in the best combinations, we 
counted the number of times each source appears in the top 50 combinations (i.e,, the 
50 combinations out of 255 with the highest average precision). The results are shown 
as percentages in table 1. We see that the identifiers found inside the 3D model files 
themselves provided the most information for classification. The WordNet synonyms 
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Table 1 . Percentage of all occurrences of each text source appearing in the best 50 combinations. 



source 


percentage in top 50 


model file 


100 


synonyms and hypernyms 


100 


link 


62 


filename without digits 


58 


filename 


58 


path 


56 


page title 


54 


page context 


50 



and hypernyms also turned out to be very useful, despite the fact that for 279 mod- 
els (31%) no synonym or hypernym was found (model names for which WordNet did 
not return a synonym or hypernym included names (e.g., “justin”), abbreviated words 
(“satellt”), misspelled words (“porche”), and words in a different language (“oiseau”)). 



3 Shape Matching 

In this section, we briefly review previous work on shape-based retrieval of 3D mod- 
els. Then, we present results comparing several standard shape matching methods to 
determine which works best on the Princeton Shape Benchmark. 



3.1 Related Work 

Retrieval of data based on shape has been studied in several fields, including computer 
vision, computational geometry, mechanical CAD, and molecular biology. For surveys 
of recent methods, see [27,28], For our purpose, we will only consider matching and 
retrieval of isolated 3D objects (so we do not consider recognition of objects in scenes, 
or partial matching, for example). 

3D shape retrieval methods can be roughly subdivided into three categories: (1) 
methods that first attempt to derive a high-level description (e.g., a skeleton) and then 
match those, (2) methods that compute a feature vector based on local or global statis- 
tics, and (3) miscellaneous methods. 

Examples of the first type are skeletons created by voxel thinning [29], and Reeb 
graphs [30]. However, these methods typically require the input model to be 2-manifold, 
and usually are sensitive to noise and small features. Unfortunately, many 3D models 
are created for visualization purposes only, and often contain only unorganized sets 
of polygons (“polygon soups”), possibly with missing, wrongly-oriented, intersecting, 
disjoint, and/or overlapping polygons, thus making them unsuitable for most methods 
that derive high-level descriptors. 

Methods based on computing statistics of the 3D model are more suitable for our 
purpose, since they usually impose no strict requirements on the validity of the input 
model. Examples are shape histograms [31], feature vectors composed of global geo- 
metric properties such as circularity or eccentricity [32], and feature vectors (or shape 
descriptors) created using frequency decompositions of spherical functions [33]. The 
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Fig. 2. Average precision/recall for four shape matching methods, and random retrieval. This 
figure is a partial reproduction of one in [5], 

resulting histograms or feature vectors are then usually compared by computing their 
L 2 distance. 

Some alternative approaches use 2D views (2D projections of a 3D model), justified 
by the heuristic that if two 3D shapes are similar, they should look similar from many 
different directions. Examples are the “prototypical views” of Cyr and Kimia [34], and 
the “Light Field Descriptor” of Chen et al. [35]. 

3.2 Shape Matching Methods 

In our experiments, we considered four shape matching methods: (1) the Light Field De- 
scriptor (LFD) [35], (2) the Radialized Spherical Extent Function (REXT) [36], (3) the 
Gaussian Euclidian Distance Transform (GEDT) [33], and (4) the Spherical Harmon- 
ics Descriptor (SHD) [33]. These four methods have been shown to be state-of-the-art 
in a recent paper [5]. 

We ran an experiment in which these four methods were used to compute a similar- 
ity score for every pair of 3D models in the test set of the Princeton Shape Benchmark. 
The similarity scores were used to rank the matches for each model and compute an 
average precision-recall curve for each matching method with respect to the bench- 
mark’s base classification. Results are shown in Figure 2 (see [5] for details). From 
these curves, we find that the Light Field Descriptor provides the best retrieval perfor- 
mance in this test, and thus we use it in all subsequent experiments. 

4 Comparing Text Matching to Shape Matching 

Next, we compare the classification performance of the best text matching method to the 
best shape matching method. Figure 3 shows the resulting average precision/recall plot. 
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recall 



Fig. 3. Average precision/recall for text and 3D shape matching. 



The shape matching method significantly outperforms text matching: average precision 
is 44% higher. 

The main cause of the relatively poor performance of the text matching method is 
the low quality of text annotation of online 3D models. Upon examination of the actual 
text associated with each model, we found that many filenames were meaningless (e.g., 
“avstesf ’ for a face), misspelled (e.g., “ferrar” for a ferrari), too specific (e.g., “camaro” 
for a car), or not specific enough (e.g., “model” for a car). Some models were annotated 
in a language other than English (e.g., “oiseau” for a bird). By running a spell checker 
on the filenames with the digits removed, we found that 36% of all model filenames 
were not English words. 

Also, for several sources there is simply no useful text available. For example, many 
link texts were either a repeat of the filename, or contained no text at all: for 446 models 
in the training set (51%) no link text could be found (usually a thumbnail image is used 
instead). Furthermore, 4 (0.4%) models had no path information (i.e., no directories 
in their URL), 193 (21%) web page titles were missing, for 279 (31%) filenames or 
link texts no synonym could be found, for 692 (76%) models no web page context was 
found, and for 153 (17%) models no text inside the model hie was found. 

Even commercial 3D model databases are not necessarily well annotated. Of three 
commercial databases available to us (provided by CacheForce, De Espona, and View- 
point, containing approximately 2000, 1000, and 1000 models respectively), only one 
was consistently well annotated. 

In all text matching experiments, the representative document created for each 3D 
model was used as a query. However, because the size and quality of text annotation 
varies a lot from model to model, one may argue that this text is not representative of 
actual user queries. Users of a retrieval system are more likely to enter a few descriptive 
keywords or class names. To investigate classification performance given this kind of 
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user queries, we ran an additional classification experiment in which the full category 
names of the Princeton Shape Benchmark were used as simulated user queries. Some 
obvious keywords were added manually (e.g., “blimp” and “zeppelin” for the "dirigible 
hot air balloon” category, or “house” for the “home” category). The average precision 
achieved when using these query keywords was 11% higher than when using the repre- 
sentative documents, but still 30% lower than the best shape matching method. 

5 Combining Text and Shape Matching 

In our final tests, we investigate how to combine the best text and shape matching meth- 
ods to provide better retrieval results than can be achieved with either method alone. 

5.1 Related Work 

A considerable amount of research has been presented on the problem of how to best 
combine multiple classifiers [37] . Most work in this area has been done in content-based 
image retrieval. For example, Srihari presents a system (“Piction”) that identifies faces 
in annotated news photographs using a face detection algorithm and Natural Language 
Processing of the captions [38]. Smith and Chang describe a system for the retrieval of 
arbitrary online images [3]. Relevant text for each image is extracted from the URL and 
the alt parameter of the HTML <img> tag, for example. However, searches based on 
low-level image features or on text can not be combined. This combination has been 
investigated in later retrieval systems. La Cascia et al. combine text and image fea- 
tures into a single feature vector, to improve search performance [39]. Text is extracted 
from the referring web page, with different weights assigned depending on the HTML 
tag that enclosed it (e.g., text in a <title> is more important than text in an <h4> 
(small header) tag). Paek et al. present a method that combines a text- and image-based 
classifier for the classification of captioned images into two classes (“indoor” and “out- 
door”) [40], which improved classification accuracy to 86.2%, from 83.3% when using 
the text alone. 

5.2 Multiclassifiers 

In previous work we suggested that the results of text and shape matching can be com- 
bined to improve classification performance [4], and proposed a combination function 
that simply averaged mean-normalized matching scores. However, no evaluation was 
done to see which combination function works best. Here we consider the simple case 
of combining the scores of two classifiers, using a static combination function. We ex- 
perimented with four types of functions: (1) linear weighted average, (2) minimum, 
(3) (minimum) rank, and (4) using confidence limits. The first two were also tested on 
mean-normalized scores. 

1. linear weighted average: If si and S 2 are the matching scores of a pair of models, 
then the combined score is w ■ Si + (1 — w) ■ S 2 , with w the weight setting. We 
computed average precision/recall for w £ {0, 0.05, 0.1, ..., 1.0}, and picked the 
value of w which resulted in the highest overall precision. The optimal weight 
setting for the training set was (0.1 • text + 0.9 • shape) 
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Table 2. Average precision achieved when combining the matching scores of the text and shape 
matching methods using various static combiners, and the percentage improvement over shape 
matching alone. 



method 


average precision 


% improvement over shape alone 


shape (LFD) 


0.507 


- 


minimum, normalized scores 


0.536 


5.8 


confidence limits, normalized scores 


0.533 


5.1 


weighted average, normalized scores 


0.523 


3.2 


confidence limits 


0.520 


2.6 


weighted average 


0.519 


2.4 


minimum 


0.508 


0.2 


minimum rank 


0.495 


-2.3 


weighted average rank 


0.487 


-4 



2. minimum: the lowest matching score (signifying highest similarity) is returned 

3. rank: The matching scores are ordered, and the resulting rank of each query be- 
comes its new matching score. We can then apply one of the first two functions 
(linear weighted average or minimum) 

4. confidence limits: The “confidence limits’" method is based on the idea that if a 
similarity score of a single classifier is sufficiently close to zero, then that classifier 
can be trusted completely. The output of other classifiers is then ignored. Sable uses 
a variant of this method when combining a text- and image-based classifier [2]: fea- 
ture vectors from both are classified using a Support Vector Machine, and a confi- 
dence level is assigned to the classification, depending on the distance of the vector 
from the dividing hyperplane (the decision boundary). If the confidence level of the 
image-based classifier is high enough, then the text-based classifier is ignored. If 
not, then the text-based classifier is used and the image-based classifier is ignored. 
We used the training set to determine optimial limit settings of 0.09 and 0.22 for 
shape and text matching respectively (and -2.45 and -1.5 for mean-normalized 
scores). If both scores were above their limit, we reverted to the linear weighted 
average (other alternatives yielded worse results) 

Table 2 shows the resulting average precision values achieved for each combiner, 
and the percentage improvement over using shape matching alone (computed using the 
test set). We achieve an additionol 5.8% improvement in average precision, using the 
function that returns the minimum of normalized scores. These results confirm that the 
text and shape representations of a 3D model are sufficiently independent, such that 
when they are combined, they become more discriminating. There may well be other 
representations (e.g. appearance-based) that capture a very different aspect of a 3D 
model, and as such could increase performance even further. 

6 Conclusions and Future Work 

This paper evaluates text and shape matching methods for retrieval of online 3D models, 
as well as their combination. Classification tests were done using the Princeton Shape 
Benchmark, a large benchmark database of 3D models downloaded from the web. 
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For text matching, we found that a variant of TF/IDF showed the best classification 
performance. Text found inside a 3D model file itself and synonyms and hypernyms 
(category descriptors) of the model filename were most useful for classification. 

The currently best shape matching method (based on Light Field Descriptors [35]) 
significantly outperformed the best text matching method, yielding 44% higher average 
precision. The main reason is that the quality of text annotation of online 3D models 
is relatively poor, limiting the maximum achievable classification performance with a 
text-based method. 

We investigated several simple multiclassifiers, and found that a function return- 
ing the minimum of normalized matching scores produced an additional performance 
improvement of about 6%. Studying other more sophisticated multiclassifiers is an in- 
teresting area for future work, considering there are many other attributes of 3D models 
(e.g., color, texture, structure) for which additional classifiers can be designed. 

The main contribution of this paper is that it demonstrates the advantage of using 
shape-based matching methods over text-based methods for retrieval of 3D models. 
This should encourage designers of future 3D model retrieval systems to incorporate 
query methods based on shape, and other attributes that do not depend on annotation 
provided by humans, as they hold much potential for improving retrieval results. 
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Abstract. We propose a probabilistic method for query expansion in 
online public access catalogs that utilizes both historical query logs and 
the subject headings in the library catalog. Our method creates corre- 
lations between query and document terms, allowing relevant subject 
headings from the corpus to be retrieved and added to a query. Ex- 
periments demonstrate an average of 31.1% performance increase over 
currently fielded baselines. 



1 Introduction 

The problem of vocabulary mismatch is a deep-rooted problem in information 
retrieval as users often use different or too few words to describe the concepts 
in their queries as compared to the words that authors use to describe the con- 
cepts in their documents [1-3]. Despite this, many library online public access 
catalogs (OPACs), such as the INNOPAC system 1 used by Library Integrated 
Catalogue (LINC) at the National University of Singapore, still depend on key- 
word matching to determine the relevant documents for queries. 

Short queries damage retrieval effectiveness in two ways: 1) they lead too 
many results and 2) the queries themselves are ambiguous. The first phenome- 
non, often called information overload, makes searching difficult as users are over- 
whelmed with information. In our case study of LINC, from March to September 
2003, queries sent to LINC have a mean length of 2.815 words. For example, for 
the top query “java”, LINC returned 32,000 books, of which 953 books had 100% 
relevance, leaving the user to select between 953 alternatives. Short queries are 
often polysemous (having multiple senses or meanings, as in “java”: the com- 
puter language or the island in Indonesia). Such queries result in ambiguity as 
words that could disambiguate them are missing. 

Query expansion, the process of expanding a user’s query with additional 
related words and phrases, has been suggested to address the problem. However, 
finding and using appropriate related words remains an open problem. Research 
on query expansion has focused on intranet or internet web search. However, the 
typical digital library OPAC contains bibliographic records which are far more 

1 http : //www. libdex. com/ vendor /Innovative. Interfaces , _Inc .html 

R. Heery and L. Lyon (Eds.): ECDL 2004, LNCS 3232, pp. 221-231, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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structured than documents on the internet. On the other hand, traditional OPAC 
research has largely focused on rule-based systems that do not take advantage 
of corpora. 

Our study melds the two approaches by analyzing library corpora for use 
in query expansion in the digital library OPAC. Our system combines both 
historical query logs and the library catalog to create a thesaurus-based query 
expansion that correlates query terms with document terms. Our process con- 
sists of three steps. First, historical query logs are analyzed to uncover frequent 
queries. These queries are sent to the OPAC to extract relevant subject headings 
from the top documents. In the last step, the system calculates probabilistic cor- 
relations between the retrieved subjects heading and users’ queries. With these 
correlations, relevant subject headings can be selected from the corpus for new, 
unseen queries. 

In Section 2, we discuss related work in query expansion, we detail the meth- 
ods used to build the thesaurus (a matrix correlating query keywords and subject 
headings) by sending queries to the OPAC and analyzing the results, and de- 
scribe how query expansion is done with the built thesaurus. Section 5 describes 
our case study in which we deployed our approach in our local OPAC. We con- 
clude with a summary and directions of further work. 

2 Query Expansion Techniques 

In query expansion, there are two key aspects: the source of expansion terms and 
the method to weight and integrate expansion terms [4, 3, 5]. Existing techniques 
can be classified as global, local or external, based on the source of terms. Global 
techniques require corpus-wide statistics such as the occurrence of expansion 
terms in the corpus and the source of expansion terms is usually the whole 
corpus. Local techniques analyze a number of top-ranked documents retrieved 
by a query to expand it. In contrast, external techniques depend on external 
resources such as domain-independent thesauri for expansion terms. 

2.1 Global Techniques 

We define a global technique as one that analyzes the contents of a particu- 
lar corpus to identify semantically similar terms. By gathering statistics of the 
co-occurrences of terms in the corpus, global techniques build statistical term 
relationships which can then be used to expand queries. Some global techniques 
are term clustering [6], global similarity thesauri [7], latent semantic indexing 
[8] and Phrasefinding [9]. Since global techniques focus only on the document 
and do not take into account the query, global techniques only offer a partial 
solution to the word mismatching problem [4]. Global techniques typically re- 
quire co-occurrence information for every pair of terms. However, most global 
techniques compute this information offline, removing a potential computational 
bottleneck. 
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2.2 Local Techniques 

As compared to global techniques, local techniques use only a subset of the doc- 
uments retrieved by a query. Local techniques can be divided into two main 
categories: interactive (i.e., relevance feedback) and automatic (i.e., local feed- 
back) . 



Relevance Feedback. In relevance feedback systems, related terms come from 
user-identified relevant documents. Relevance feedback was originally designed 
to be used with the vector space model [10]. However, relevance feedback has 
also been incorporated into Boolean retrieval models [11] and probabilistic re- 
trieval models [12,13]. Other methods such as incremental relevance feedback 
[14] have also been proposed, which analyzes previous queries and relevance 
judgments in the same session to improve search effectiveness. [15] proposed 
Adaptive Relevance Feedback (ARF) on top of incremental relevance feedback 
to detect changes in users’ information goals. However, in a real search context, 
users are usually reluctant to provide any type of feedback [4, 16]. 

Local Feedback. Local feedback uses the top-ranked documents retrieved by 
a query as a viable source of information. The basic assumption is that the top- 
ranked documents retrieved are relevant, and thus the words in the top-ranked 
documents themselves can be used to expand the query. While the performance 
of local feedback can be erratic, it has shown good performance in Text REtrieval 
Conferences (TR.EC) experiments [17]. The TREC test collections are often used 
to evaluate query expansion techniques [1-3,5]. 

Many improvements have been suggested to local feedback, such as using 
Boolean filters and proximity constraints to refine the set of top-ranked docu- 
ments [16], exploiting potentially relevant documents from past similar queries 
[18] and using information theory in weighting and selecting expansion terms 
[1]. The idea of local context analysis was also proposed [3,5] which combines 
the idea of global and local techniques to select expansion terms based on co- 
occurrences with the query terms within the top-ranked documents. More re- 
cently, historical user logs were also used to deduce likely user interactions [4], 

2.3 External Techniques 

External techniques make use of external resources, such as online thesauri which 
are not tailored for any particular collection, for query expansion. In past work, 
general reference resources such as Longman’s Dictionary of Contemporary En- 
glish (LDOCE) and WordNet [19] have been used. Because of the ambiguity of 
terms and the existence of specialized terms for certain collections, these thesauri 
might be difficult to use. Voorlrees reported improvements of 1% for longer and 
less ambiguous queries but expanding shorter queries actually degraded perfor- 
mance. Based on these neutral results, we have decided not to pursue the use of 
general, external resources in our research. 
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3 Building a Learning Thesaurus 

The Online Catalog Evaluation Projects reported that library patrons have prob- 
lems matching their terms with those indexed in the online catalog and do not 
understand the printed LCSH (Library of Congress Subject Headings) [20]. To 
solve this vocabulary problem, one possible solution is to map query terms to 
the underlying vocabulary of the corpus by building a corpus-specific thesaurus. 

To build such a thesaurus for an OPAC, frequently-occurring queries were 
first harvested from historical query log kept by the OPAC. Each of these queries 
were re-sent to the OPAC, generating a ranked list of relevant documents. We 
adopted local feedback to extract the top relevant documents, and extracted the 
subject headings for each document. We then mapped the query keywords to 
the frequency of the subject headings in the relevant documents. 

There are two design details that are important in our system’s architec- 
ture. First, we used local feedback rather than standard relevance feedback, as 
it requires no explicit relevance judgments or click-streams because it is fully 
automated and requires no user effort [15]. Second, we used frequently-occurring 
queries in our historical query logs to build our thesaurus for the initial queries. 
Subsequent new queries submitted by users need to also be analyzed and the 
thesaurus updated to reflect the change in query patterns. 

3.1 Subject Headings 

Subject headings are usually assigned by expert cataloguers. These headings are 
used to index the documents in OPACs. Standardized, controlled vocabulary 
terms or subject headings are usually employed, such as the Library of Congress 
Subject Headings (LCSH), the Library of Medicine’s Medical Science Subject 
Headings (MeSH), or the Dewey Decimal Classification (DDC). 

Compared to book titles, subject headings are more objective and precise. 
For example, subject headings “Genetics”, “Evolution (Biology)” and “Behav- 
ior genetics” are clearer than the title “The Selfish Gene”. Thus we feel that 
it would be less ambiguous to use the subject headings as expansion terms in 
comparison to book titles. Using subject headings also provides us with knowl- 
edge from experts, which are less prone to errors, and eliminates the need to 
use automatic term weighting algorithms, such as Term Frequency x Inverse 
Document Frequency (TFxIDF), to extract terms from the corpus. 

3.2 Correlating Query Terms and Document Terms 

In this study, we attempt to create links between keyword query terms and the 
subject headings documented in OPACs by the librarians and cataloguers. Our 
key observation is that if queries containing a certain term often lead to the 
selection of documents containing another term, then we consider that there is 
a strong relationship between these two terms [4]. 

We assumed that subject headings from the top documents retrieved using 
queries containing a particular keyword were related to that keyword. For exam- 
ple, the terms, “macromedia” and “flash”, in the query “macromedia flash” are 
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Query Keywords 



Documents 



Subject Headings 



Macromedia 

flash 




Macromedia Flash MX: 
the complete reference 



Macromedia Flash 
super samurai 



Macromedia Flash MX: 
training from the source 



Macromedia Flash MX 
developer’s guide 




| Computer Animation 
| Web Sites — Design 
| Computer Graphics 



Internet Programming 



Fig. 1. Correlations between query keywords and subject headings. 

regarded as relevant to the documents that the query retrieved, e.g., “Macrome- 
dia Flash MX developer’s guide”, and the subject headings of these documents, 
“Computer animation” . By acquiring and analyzing a large pool of queries and 
collecting their top-ranked documents, we are able to form associations between 
the queries and documents. These associations allow us to create a thesaurus 
that will aid in query augmentation by mapping keywords in the queries to the 
subject headings. Figure 1 shows how the subject headings are related to the 
query keywords. 

4 A Corpus-Based Query Expansion Model 

4.1 Relation of a Document Term to the Entire Query 

Xu and Croft [5] and Qiu and Frei [7] hypothesized that relevant terms tend 
to co-occur with all query terms in the top-ranked documents. A similar idea is 
applied in this study; subject headings that occur with all or most of the query 
keywords in the thesaurus are considered more relevant than subject headings 
that only occur with a few keywords. In other words, we should consider a term 
that is similar to the query concept rather than one that is only similar to a 
single term in the query. 

Consider the queries Ql: “Java” and Q2: “Java Indonesia”. While Q1 is 
ambiguous and could mean the programming language developed by Sun Mi- 
crosystems, the Indonesian capital island or the coffee bean, Q2 is much less 
ambiguous and more likely to refer to the Indonesian capital island than to the 
other meanings. Subject headings that co-occur with both “Java” and “Indone- 
sia” are likely to be relevant to Q2 and should be given a higher weight than 
terms that only occur with “Java” or “Indonesia”. Terms should be selected 
based on their similarity to the entire query instead of just a few query terms. 
In contrast, many query expansion techniques add a term even when the term is 
only strongly related to just one of the query terms [7], resulting in suboptimal 
performance. As a result, our approach prefers subject headings that co-occur 
with more query terms over those co-occurring with fewer query terms. 
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To determine the correlation between a query term w q and a subject heading 
Wd, we calculate the degree of co-occurrence of Wd and w q . That is, we need to 
calculate co-degree(wd, w q ) [5]. We estimate co-degree(wd,w q ) as the likelihood 
that Wd and w q co-occurs non-randomly in the top-retrieved documents retrieved 
by queries containing w q by using an adapted equation from the normalized 
TF*IDF weighting scheme [21]: 



co-degree(wd, w q ) 



f(w d ) x In 



m 

df(w d ) 



(1) 



where f(wd) is the frequency which the subject heading Wd co-occurs with the 
query keyword w q in the thesaurus, m is the total number of distinct query 
keywords and df(wd) is the total number of distinct query keywords that co- 
occur with Wd in the thesaurus. A higher f(wd) will indicate that the subject 
heading Wd is more important over another subject heading with a lower f(wd)- 
On the other hand, the higher df(wd) is, the more likely that Wd co-occurs with 
the query keyword w q by chance or that Wd might be ambiguous because it is 
related to many different keywords. 

The above calculates the relevance of the subject headings with individual 
query terms. We also need to measure the relationship of the subject heading 
with regards to the entire query. Many researchers [4, 7, 5] have proposed meth- 
ods to measure the degree of co-occurrence with all query terms. We measure 
the relationship of a term to the entire query using the following cohesion weight 
calculation [5]: 



g(w d ,w q ) = n (5 + co-degree(wd 1 w q )) (2) 

all query terms 

where, S is a smoothing factor to assign a small, non-zero probability to subjects 
that only co-occur with only one query term that would otherwise receive a 
weight of zero. With a small 6, subject headings that co-occur with all query 
terms are ranked higher and with a large 6, subject headings having significant 
co-occurrences with individual query terms are ranked higher [5]. As we prefer 
subject headings that co-occur with more query terms over those co-occurring 
with fewer, we set the smoothing factor S to a small value of 0.001. 

After the cohesion weights of the subject headings related to the query have 
been calculated and the weights normalized, subject headings for query expan- 
sion have to be selected. We select subject headings which have weights above 
the threshold (3 . The default value of is 0.03. The new query Q’ will be reformu- 
lated by adding these subject terms into the original query. Q’ will then be used 
to retrieve documents. 

5 Experimental Evaluations 

In this section, we describe the methodology and data collection of the experi- 
ment before illustrating our experimental findings. 
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5.1 Evaluation Methodology 

The objective of the experiments reported in this section is to test whether query 
expansion using the thesaurus we have constructed can be used to improve the 
retrieval effectiveness compared to the original (unexpanded) queries. Our local 
OPAC, an INNOPAC-based system, allows the user to select the method for 
sorting the results. The two most common methods are to sort by relevance or 
date; i.e. either the most relevant documents or the most recent documents are 
ranked higher. Thus, we compare our query expansion results using both date 
and relevance sorting methods. 

IR performance are usually assessed using standard metrics of precision and 
recall. However, boolean retrieval (used in many OPACs) with or without query 
expansion retrieves the same set of documents and thus query expansion changes 
only the ranking of the documents within the ranked list. As such, absolute 
precision and recall are not as suitable metrics. Instead, we use precision of the 
top k documents or precision-at-fc [1,2,22] as our performance metric. 

We measured the precision over first k documents for both our system and 
the baseline method, sorted by date and relevance, where k = 12, 24, 36, 48 or 
60, as our INNOPAC shows twelve documents per screen. The objective of this 
experiment is to determine which solution will retrieve more relevant documents 
in the first k retrieved documents. The user model for this experiment is that 
the user typically reads only the first k documents and not all the documents 
[22]. In addition, users are usually more interested in the precision of the results 
displayed in the first page of the list of retrieved documents [23]. The default 
threshold (3 - value is set to 0.03 and the (5-value is set to 0.001 in this experiment. 

To determine the optimal threshold /5-value, we also tested the effect of using 
different thresholds, /3-values, for query expansion. We experimented with (3- 
values of 0.01, 0.03, 0.05 and 0.1, and we measured the precision over first k 
documents, where k = 12, 24, 36, 48 or 60. The 5-value is set to 0.001 for this 
experiment. In addition, we also tested the effect of using different 5- values in 
the cohesion weight Equation 2 to find out the optimal 5-value in our cohesion 
weight equation. For the 5-values of 0.001, 0.01, 0.05 and 0.1, we measured the 
precision over first k documents for query expansion, where k = 12, 24, 36, 48 
or 60. We used the default (3 - value of 0.03 for this experiment. 

5.2 Data Collection 

For our experiments, we collected queries from real OPAC users in the School of 
Computing at National University of Singapore (NUS) and at various discipline 
specific libraries. We conducted short interviews with the users to document their 
information needs. A total of 39 queries and their descriptions were collected. 
The users were requested to provide the query keywords they used, describe what 
they were searching for in detail, and identify topics that are likely to be relevant 
as well as topics that are likely to be irrelevant. Based on the descriptions given, 
we were able to judge the relevance of the documents. 

These queries had an average length of 2.05 words and cover various topics 
from computer science to medicine. The experiments were conducted on the 
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Table 1 . Comparison of baseline and query expansion results. 



k 


Baseline (Date) 


Baseline (Relevance) 


Query Expansion 


12 


0.4209 


0.4209 


0.5747 (+36.55%, +36.55%) 


24 


0.3536 


0.3856 


0.5128 (+45.02%, +32.96%) 


36 


0.3169 


0.3589 


0.4686 (+47.87%, +30.56%) 


48 


0.2932 


0.3440 


0.4375 (+49.18%, +27.17%) 


60 


0.2743 


0.3192 


0.4038 (+47.20%, +26.51%) 


Average 


0.3318 


0.3657 


0.4795 (+44.51%, +31.10%) 



heterogeneous NUS LINC corpus, which consisted of 1,209,509 unique titles as 
at June 2003 2 . 

5.3 Experimental Results 

We now present the experimental results of query expansion on LINC, using 
the metric discussed earlier, precision-at-fc. The original (unexpanded) queries, 
sorted by date and relevance, were used as the baseline in the experiments. The 
results are presented in Table 1. 

Our thesaurus-based query expansion performed very well as compared to us- 
ing LINC without query expansion, with an improvement of 44.51% and 31.10% 
performance improvement over the average precision-at-fc, for date and relevance 
sorting, respectively. This suggests that our version of query expansion is indeed 
useful in improving the retrieval effectiveness of the search. The reason for the 
improved performance is that some relevant documents which are ranked low by 
the original queries are propelled to the top of the ranked output because they 
contain many subject headings. In addition, query expansion was able to improve 
the retrieval performance of ambiguous queries. An example is the query “erp”, 
in which the user’s intention was to find books related to Enterprise Resource 
Planning (ERP) but some of the documents retrieved by the unexpanded query 
included terms such as expressway robbery permit and Event-Related Potentials, 
which were irrelevant to the user’s information needs. 

Determining the Threshold, ( 3 - Value. In Table 2, we list the retrieval per- 
formance of query expansion using different /3-values of 0.01, 0.03, 0.05 and 0.1. 

Table 2 shows the effect of (3 - value on the performance of query expansion. 
We can see that the average precision-at-/c tends to be slightly higher for /3-values 
of 0.03 and 0.05. This is because when /3-value gets too large, potentially rele- 
vant subject headings lower than the threshold are sometimes not selected, for 
example, for the query “statistics”, the relevant subject heading “Mathematical 
Statistics” was omitted. Thus, setting /3-value too high will degrade retrieval 
performance by omitting potentially relevant subject headings. On the other 
hand, when (3 - value is too small, irrelevant subject headings are selected, for ex- 
ample for the query “culture shock”, irrelevant subject heading “Cell Culture” 
was added. A small /3-value will allow irrelevant subject headings to be added, 
which also decreases retrieval performance. 

2 http : //www. lib .nus . edu. sg/ about/ stats02-03.html 
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Table 2. Effect of threshold /3-value on performance of query expansion. 



i k 


i-H 

o 

o 

II 


Xo 

II 

o 

o 

00 


P = 0.05 


p = 0.1 




0.5512 


0.5747 


0.5534 


0.5427 




0.4882 


0.5138 


0.4967 


0.4679 




0.4487 


0.4707 


0.4537 


0.4301 


48 


0.4209 


0.4380 


0.4262 


0.4059 


60 


0.3863 


0.4038 


0.3910 


0.3756 


Average 


0.4591 


0.4802 


0.4642 


0.4445 



<5-Value. To find out the optimal value for 8 in the cohesion weight equation, 
we measured the precision over the first k documents retrieved by our system. 
We use different values for 8 in the cohesion weight Equation 2 and compare the 
results below. 



Table 3. Effect of 5-value on performance of query expansion. 



k 


5 = 0.001 


i-H 

o 

o 

II 


5 = 0.05 


5 = 0.1 


12 


0.5747 


0.5384 


0.5128 


0.4914 


24 


0.5128 


0.4850 


0.4529 


0.4423 


36 


0.4686 


0.4423 


0.4166 


0.4002 


48 


0.4375 


0.4145 


0.3931 


0.3755 


60 


0.4038 


0.3833 


0.3662 


0.3504 


Average 


0.4795 


0.4527 


0.4283 


0.4120 



Table 3 shows the effect of 5 - value on the performance of query expansion. We 
can see that the average precision-at-fc tends to decrease as (5-value increases. 
Using a 5-value of 0.001 as the baseline, average precision-at-/c fell by 5.59%, 
10.67% and 14.08% when the 5-value increases to 0.01, 0.05 and 0.1 respectively. 
This is because when 5-value gets too large, it dominates the cohesion weight 
equation that we discussed earlier, making the more crucial factor co_weiglrt less 
important. The cohesion weights of the subject headings then become inaccu- 
rate, which often causes relevant subject headings to be omitted. To illustrate, 
the relevant subject heading “Evolution (Biology)” was omitted for the query 
“evolution” and for the query “C”, the relevant subject heading “C (Computer 
Program Language)” was omitted. If a small 5-value is used, subject headings 
co-occurring with more terms are given heavier weights. Xu and Croft [5] men- 
tioned “concepts co-occurring with all query terms are good for precision” . Our 
experimental results also imply that 5- value should not be too large, as it is only 
a smoothing factor and should not dominate the cohesion weight equation. 

6 Conclusion 

We proposed a method for automatic query expansion in OPACs based on the 
domain knowledge contained in an automatically constructed thesaurus which 
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maps query keywords to document subject headings. To build this thesaurus, 
historical query logs were analyzed to find out the most frequently-occurring 
queries and keywords library patrons use. Document terms were then extracted 
from the top-ranked documents retrieved by these queries and statistical cor- 
relations between query keywords and document subject headings were created 
to support query expansion. The experimental results show that our solution is 
practical and offers significantly better performance than the unexpanded base- 
line. Precision on the first screen of ranked documents improved by over 30% in 
our experiments. 

Although we have successfully incorporated our query expansion technique 
into a widely-used OPAC and demonstrated its effectiveness, there is still much 
room for improvment. In our future work, we plan to investigate how phrase 
structure can refine the terms collected in our OPAC-specific thesaurus. In ad- 
dition, we are exploring how document metadata such as MARC metadata, can 
be harnessed for further query expansion. 



Acknowledgments 

We wish to thank our colleagues over at NUS Libraries for their generous contri- 
bution of the LINC query logs for our research use and their continued support 
of our on-going work to improve OPAC usability. We also would like to thank 
the anonymous reviewers for their helpful suggestions. 



References 

1. Carpineto, C., De Mori, R., Romano, G., Bigi, B.: An information-theoretic ap- 
proach to automatic query expansion. ACM Transactions on Information Systems 

19 (2001) 1-27 

2. Carpineto, C., Romano, G., Giannini, V.: Improving retrieval feedback with multi- 
ple term-ranking function combination. ACM Transactions on Information Systems 

20 (2001) 259-290 

3. Xu, J., Croft, W.: Query expansion using local and global document analysis. In: 
Proceedings of the 19th Annual International ACM SIGIR Conference on Research 
and Development in Information Retrieval, (SIGIR ’96), Zurich, Switzerland, ACM 
Press (1996) 4- 11 

4. Cui, H., J.-R., W., Nie, J.Y., Ma, W.Y.: Query expansion by mining user logs. 
IEEE Transactions on Knowledge and Data Engineering 15 (2003) 829-839 

5. Xu, J., Croft, W.: Improving the effectiveness of information retrieval with local 
context analysis. ACM Transactions on Information Systems 18 (2000) 79-112 

6. Sparck Jones, K.: Automatic Keyword Classification for Information Retrieval. 
Butterworths, London, UK (1971) 

7. Qiu, Y., Frei, H.: Concept based query expansion. In: Proceedings of the 16th 
Annual International ACM SIGIR Conference on Research and Development in 
Information Retrieval, (SIGIR ’93), Pittsburgh, USA, ACM Press (1993) 160-169 

8. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput- 
ing Surveys 34 (2002) 1-47 




Corpus-Based Query Expansion in Online Public Access Catalogs 231 



9. Jing, Y., Croft, W.: An association thesaurus for information retrieval. In: Pro- 
ceedings of the Intelligent Multimedia Information Retrieval Systems. (RIAO ’94), 
New York, USA (1994) 146-160 

10. Salton, G., C., B.: Improving retrieval performance by relevance feedback. Journal 
of the American Society for Information Science and Technology 41 (1990) 288-296 

11. Radecki, T.: Incorporation of relevance feedback into boolean retrieval systems. 
In: Proceedings of the 5th Annual ACM Conference on Research and Development 
in Information Retrieval, West Berlin, Germany, ACM Press (1982) 133-150 

12. Robertson, S.E., Sparck Jones, K.: Relevance weighting of search terms. Journal 
of the American Society for Information Science 27 (1976) 129-146 

13. Sparck Jones, K.: Search term relevance weighting given little relevance informa- 
tion. Journal of Documentation 35 (1979) 30-48 

14. Aalbersberg, I.: Incremental relevance feedback. In: Proceedings of the 15th Annual 
International ACM SIGIR Conference on Research and Development in Informa- 
tion Retrieval, Copenhagen, Denmark (1992) 11 -22 

15. Eguclii, K., Ito, H., A., K., Y., K.: Adaptive and incremental query expansion 
for cluster-based browsing. In: Proceedings of the 6th International Conference on 
Database Systems for Advanced Applications, (DASFAA ’99,, Hsinchu, Taiwan, 
IEEE Computer Society (1999) 25-34 

16. Mitra, M., Singhal, A., C., B.: Improving automatic query expansion. In: Proceed- 
ings of the 21st Annual International ACM SIGIR Conference on Research and 
Development in Information Retrieval, (SIGIR ’98), Melbourne, Australia, ACM 
Press (1998) 275-281 

17. Voorhees, E.M., Harman, D.: Overview of the 6th text retrieval conference (trec-6). 
In: Proceedings of the 6th Text Retrieval Conference (TREC-6). Number 500-240 
in NIST Special Publication (1998) 

18. Fitzpatrick, L., Dent, M.: Automatic feedback using past queries: Social search- 
ing? In: Proceedings of the 20th Annual International ACM SIGIR Conference on 
Research and Development in Information Retrieval, (SIGIR 1997), Philadelphia, 
USA, ACM Press (1997) 306-313 

19. Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Proceedings 
of the 17th Annual International ACM SIGIR Conference on Research and Devel- 
opment in Information Retrieval, (SIGIR ’94), Dublin, Ireland, ACM Press (1994) 
61-69 

20. Markey, K.: Subject searching in library catalogs: Before and after the introduction 
of online catalogs. Number 4 in OCLC Library, Information and Computer Science 
Series. OCLC Online Computer Library Center, Dublin, Ohio (1984) 

21. Wu, H., Salton, G.: A comparison of search term weighting: term relevance vs. in- 
verse document frequency. In: Proceedings of the 4th Annual International ACM 
SIGIR Conference on Information storage and retrieval: theoretical issues in infor- 
mation retrieval, Oakland, California, ACM Press (1981) 30-39 

22. Davis, E.: Web search engines: Retrieval. 

http : //wot . cs .nyu.edu/ courses/f all02/G22 . 3033-008/lec5 .html (2002) 

23. Kobayashi, M., Takeda, K.: Information retrieval on the web. ACM Computing 
Surveys 32 (2000) 144-173 




Automated Indexing with Restricted Random 
Walks on Large Document Sets 



Markus Franke and Andreas Geyer-Schulz 

Institut fur Informationswirtschaft und -management 
Universitat Karlsruhe (TH), 76128 Karlsruhe, Germany 

{maf , ags}@em.uni-karlsruhe . de 



Abstract. We propose a method based on restricted random walk clus- 
tering as a (semi-)automated complement for the tedious, error-prone 
and expensive task of manual indexing in a scientific library. The first 
stage of our method is to cluster a set of (partially) indexed documents 
using restricted random walks on usage histories in order to find groups 
of similar documents. In the second stage, we derive possible keywords 
for documents without indexing information from the frequencies of key- 
words assigned to other documents in their respective cluster. 

Due to the specific clustering algorithm, the proposed algorithm is still 
efficient with millions of documents and can be deployed on standard PC 
hardware. 



1 Motivation and Introduction 

As of today, the index quality of catalogues in scientific libraries is deplorable: 
Large parts of the inventory are not indexed and will probably never be, since 
manual indexing is a time-consuming and thus expensive task. For instance, in 
a sample of 38720 documents drawn at random from the Online Public Access 
Catalogue (OPAC) of the Universitatsbibliothek at Karlsruhe University (TH), 
11594 (approximately 30%) had no keyword, although the library has the repu- 
tation for having the best catalogue in Germany. This problem has to be faced 
by most libraries today, whether they are conventional or digital. 

Given the dimensions of the network catalogue of the Sudwestdeutsche Bib- 
liotheksverbund (SWB) hosted at Karlsruhe with approximately 15 million doc- 
uments, a manual post-editing of the missing classifications cannot be financed. 
On the other hand, proper index information is a crucial condition for scientific 
work with the library’s literature. However, for the use in our scenario, the meth- 
ods sketched in section 2 that perform a classification based on the content of 
the library’s documents cannot be applied since no digital representation exists 
for the major part of the documents in a conventional and even a hybrid library 
like the one at Karlsruhe. We therefore had to resort to the only data available, 
which means in this case three years’ worth of usage histories of the documents, 
gathered from the log hie of the library’s web interface. These usage histories 
describe the course of user sessions with the interface and contain identifiers for 
all documents viewed. 



R. Heery and L. Lyon (Eds.): ECDL 2004, LNCS 3232, pp. 232—243, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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On the basis of the usage histories, a recommender system based on Ehren- 
berg’s repeat buying theory [1] is already operative at Karlsruhe [2] that returns, 
for a given document, documents that are similar to this one. The recommender 
system is based on the assumption that, if two documents have been inspected 
by a user in one session with the OPAC, there is a high probability that they 
are complementary. The user acceptance of this recommendation service has 
been consistently high, details on the evaluation can be found in [3]. For scien- 
tific libraries, complementarity of documents frequently follows from the highly 
specialized structure of research enquiries. In addition, research enquiries usu- 
ally aim at getting a survey of the state of the art in a very limited held. This 
property supports the assumption that cross-occurrences may serve as a good 
predictor of similarity. Commercial recommender systems as e.g. amazon. corn’s 
are usually hybrids which allow product managers to introduce biases into rec- 
ommendations. The exact information on the frequency of bias is not available 
for the general public. 

Departing from this idea, we perform a clustering of the documents based on 
usage histories so as to determine the distribution of keywords in document clus- 
ters. From these distributions we derive recommendations for further associations 
between documents and keywords that can be used by the library’s personnel as 
a suggestion for indexing or even generate these associations automatically. 

This paper comprises the following sections: After the introduction, section 
two will briefly present common approaches to automated document classifica- 
tion. Section three describes the basic restricted random walk clustering algo- 
rithm that enables us to cluster such large document sets efficiently while section 
four presents the derivation of keywords from the clusters. In section five, we will 
discuss the results, especially the quality of the generated keyword suggestions 
before concluding the paper with section six. 

2 Existing Approaches 

Presently the mainstream methods for automated indexing or classification of 
documents are based on some kind of content analysis of a digital full text or ab- 
stract. Yang [4] gives an overview and a comparison of statistical approaches of 
text classification. Among these, the idea of kNN indexing ( k nearest neighbors) 
is quite close to our approach in some aspects: While we use restricted random 
walk clustering on usage histories in order to determine similar documents, kNN 
is based on a search for the k nearest neighbors in terms of a textual similarity 
measure. The common aspect is the actual generation of indexing information: 
Potential classifications are derived from the neighboring documents’ classifica- 
tions, possibly weighted with their respective similarity to the first document in 
the case of kNN. The kNN approach has been tested as a specific sample of mem- 
ory based reasoning (MBR) by Creecy et al. [5] on answers to the questionnaire 
of the 1990 Decennial Census in the United States. However, for this specific 
application setting, only a relatively low precision of 60% could be achieved. 
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Sebastiani [6] gives a survey on variants of the currently predominant ap- 
proach, namely machine learning (ML) that has replaced the knowledge engi- 
neering paradigm of the 1980s. While the latter relies on the encoding of expert 
knowledge in a set of rules, machine learning algorithms acquire their knowledge 
from a training data set and generalize the insights thus gained to the “real” 
data. In this context, Neural Network [7] and Support Vector Machines [8] should 
be mentioned. 

Among the ML-based solutions, there exist some that take into account not 
only the textual information, but also the (graphical) context of the text and its 
layout [9]. 

However, the basic principle of all these techniques is to use intrinsic infor- 
mation from the documents themselves. This aspect differentiates our idea from 
the ones cited above in that we cannot resort to digital full texts in a hybrid 
library like Karlsruhe. We use external data on documents instead, namely usage 
histories, and derive similarity information from them. Consequently, we are not 
dependent on a digital representation of the documents to be classified, simple 
usage statistics that can be gathered from the IT systems present at most of 
today’s libraries at little cost are sufficient for generating indices of at least com- 
parable quality. In a digital library, it is even easier to gather the usage histories, 
since the whole process of information search, purchase and delivery is embed- 
ded in a digital process. Of course, nothing precludes a treatment of the digital 
documents with sophisticated content-analysis methods in a second stage. 

3 The Cluster Method 

A comprehensive survey of commonly used cluster algorithms is given in [10-12]. 

The library of the Karlsruhe University (TH) offers access to bibliographic 
data of about 15 million documents available in the SWB. Clearly conventional 
cluster algorithms cannot be applied to a data set of this size since these meth- 
ods have a superlinear time complexity so that clustering of very large data 
sets becomes computationally intractable. For the single linkage clustering algo- 
rithm, this has been described exemplarily by Viegener [13]. The method that 
we propose in this article is based on the work of Scholl and Pasclringer [14] and 
has been adapted for the specific environment of library usage histories [15]. It 
has the advantage of a linear time complexity while producing results of a high 
precision. 

The actual clustering algorithm can be divided into two stages, the walk 
stage and the cluster construction stage. These are followed by the evaluation of 
the clusters in order to gain keyword information. 

The Walk Stage 

The fundamental principle of clustering with restricted random walks (RRW) as 
proposed in [14] is to execute a series of random walks on a set of objects given a 
complete distance measure between them in such a way that, with growing length 
of the walk, only closer and closer objects are chosen. The walk terminates when 
there is no object closer to the current one than the previous one. 
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In the context of library usage data, we change the perspective: Instead of 
distances, we consider similarity measures, which is the more natural approach 
for our application. Of course, this transformation does not imply any substantial 
changes; a strictly monotonous transformation suffices. As a similarity measure, 
we use the co-usage or cross-occurrence frequency that is defined as follows: 

Definition 1. When a user browses the detail page of a document in the li- 
brary’s WWW OP AC interface, this is a purchase occasion for this document. 
If, in the course of the session, the user browses the detail page of another doc- 
ument, this is a cross-occurrence between the two. 

In order to obtain a similarity measure, we extract the cross-occurrence frequen- 
cies from the WWW server’s log files and store them in so-called raw baskets. 
A raw basket exists for every document that has been viewed at least once to- 
gether with another document. It contains all its cross-occurrences with other 
documents as well as the respective frequencies. The similarity measure s(i,j) for 
two documents i and j having usage histories is defined as the absolute frequency 
of their cross-occurrence. The self-similarity can be defined arbitrarily, it 

is not used by the algorithm. 

This interpretation of cross-occurrences is justified by the intuition that pro- 
ducts - documents in this case - that are frequently bought together will in 
general have a high complementarity. In the context of scientific research, com- 
plementarity of information products often relates to similarity with respect to 
topic. This fact is used for example by market basket analysis in the marketing. 

We construct a weighted similarity graph G = ( V . , E, u>) as input for the 
algorithm. Since we work on a document set, the set of vertices V is naturally 
given as the set of documents available in the SWB that have a usage history. 

For the edges and the weights, we used the implied similarity information 
contained in the usage histories of the documents: E, the set of edges contains 
an edge between each pair of documents with a positive similarity. The weights 
u>ij on the edges are set to the respective similarity s(i,j) between the vertices 
they connect. Formally, the similarity graph thus constructed can be written as 
G = (' V,E,uj ): E = e V x V\s(i,j) > 0 ,i^j}, Uij = s(i,j). 

We begin the walk by picking a start node io from V. For this node, we define 
a set of possible successors as the neighbors of i q: 

To = {j€V\(i 0 ,j)€Ej (1) 

The second node i\ is chosen with equal probability from Tq. We store the 
similarity between these nodes as the step width Si = s(io,*i). For each further 
step, the following restriction is added, based on the step width of the last step. 
This last step width is used as a minimum requirement for the similarity of 
the documents participating in the following steps. Generally, in the m-tlr step, 
to > 1, the node i m + 1 is picked from the set 

Tm = {j € Vj (imij') € E, s(i m ,j) > s m } (2) 

with equal probability. The step length is updated accordingly: 
s m+ 1 = s(i rn ,i m + 1). These iterations are repeated until T m is empty. 
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The walks are very short compared to the size n of the object set, Scholl 
and Paschinger give an estimation of 0(log n). Since this is not sufficient to 
cover the whole document set, several walks must be started. We followed the 
approach in [14] to start a walk from each node. Results from random graph 
theory [16] however suggest that walks comprising a total number of in log n + 
10 n randomly chosen edges, that is 0(^n+ walks, are sufficient to cover 

the object set with a 99.995% probability. 



The Cluster Construction Stage 

Clustering with restricted random walks generates a (quasi-)hierarchical cluster- 
ing of the documents. Graphically, the result can be represented by a dendro- 
gram. In order to obtain a grouping of the data, it is sufficient to choose a cutoff 
parameter l, that is a height at which we make a “cut” through the dendrogram. 
Two possible methods exist for the construction of clusters from the data of the 
walk stage: Component clusters and the walk context. 

1. The original method developed by Scholl and Paschinger [14] is that of 
component clusters. From the step data of the walk, a series of graphs 
Gk = (V, Ek) is constructed, with V the set of objects having a usage his- 
tory. For every pair of objects that has formed the fc-th step of any walk, Ek 
contains an edge between these two vertices. For each cutoff l, the union 

Hi = U ZiGk (3) 

is constructed. On this structure, Scholl and Paschinger define clusters as 
components (connected subgraphs) of Hi. Thus, at level l, two documents 
are in the same cluster if and only if there is a path in Hi between them. 
Unfortunately, this variant of the cluster construction stage has proven to 
return results that are much too large with respect to the aim of our appli- 
cation [15]. This is due to so-called bridge elements, i.e. documents that are 
frequently viewed together with documents from different domains. For ex- 
ample, a book on statistics might be viewed by economists and sociologists 
alike, thus building a “bridge” in the data that allows the restricted ran- 
dom walk to still change between clusters at a late stage of the walk. If this 
happens, the two clusters are merged by the concept of component clusters, 
even though they only share one single link. The effect is known in clustering 
literature as chaining, and while it is much weaker with restricted random 
walks than with single linkage clustering [13], it is still too pronounced for 
the results to be usable for indexing. 

2. Therefore, we developed the concept of the walk context. The intuition is that 
if the histories of the walks are taken into account, the probability of merging 
clusters completely is much weaker. If a walk crosses cluster boundaries, the 
effect is much more limited. 

Furthermore, we introduced the step level: Since the discriminatory power 
of the step numbering as used in the component cluster variant is blurred 
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when walks of strongly differing lengths are present in the data, we added a 

step number 

normalization and defined the level of a step as the ratio wa ik length ' With 
this, we stress the relative position of a step in a walk rather than its absolute 
value. For instance, step number two in a two-step walk is important, whereas 
it would be quite insignificant in a walk comprising 25 steps which is reflected 
in the respective levels of 1 and 0.08. 

For the walk context of a document * at a given level l , we consider all walks 
containing i in a step with a level greater or equal to l. The walk context is 
composed of all documents contained in steps of these walks also at a level 
greater or equal to l. The cluster is the union of the walk context with i. 
The walk context variant has proven to produce clusters with a consider- 
ably better precision for our specific setting than the component cluster 
variant [15]. 

For an example, consider Fig. 1 depicting a section of a similarity graph. We 
will construct clusters for the node A of the graph. Edges contained in a set will 
always be alphabetically ordered. The dotted lines depict edges to nodes that 
belong to the graph, but do not participate in the example. 

We start our first walk at node io = A. The set Tq = {B, C, D} contains the 
possible successors, from which we pick one at random, say i\ = C. The step 
width is set to si = s(A, C) = 6. The set T\ is composed of all nodes neighboring 
to C and having a similarity greater than 6, so T\ = {G}. Since only G is left 
as successor, we pick it and update S 2 = s(G, G ) = 7. Now Ti is empty, so the 
walk ACG ends. We assign the step levels of 0.5 and 1 to the steps AC and 
CG. Similarly, we might get the walks BGHCA , CBAD , DA, EDA, FBCA, 
GHCA and HCA. 

Using the component cluster approach, we obtain the series Gk = ( V,Ek ) 
with Ei = {AC, AD, BC, BF, BG, CH, DE, GH} for the nodes participating in 
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the first step and E 2 = {AB, AC, AD, BC, BG, CH, GH}, E 3 = {AC, AD, CH}, 
Ei = {AC}. 

This leads to graphs H 3 = (V, {AB, AC, AD, BC, BF, BG,CH, DE,GH}), 
H 2 = ( V,{AB,AC,AD,BC,BG,CH,GH }), H 3 = (V, {AC, AD,CH}) and 
H 3 = [V, {AC}). A cluster at level l = 2 for node A is the set {A, B, C, D, G, H}. 

With our approach, the walk context for A at level l = 0.6 is the set 
{B, C, D, H}, so consequently, the cluster is the set {A, B, C, D, H} from the 
walks BGHCA, CBAD, DA, EDA, FBCA, GHCA and HCA. 

There exists an alternative formulation for the restricted random walk on the 
edges of the graph that allows an analysis of the random process with the tools 
of Markov chain theory. Details can be found in [15]. The results are the same 
for both formulations. 

The complexity of the cluster generation with restricted random walks nor- 
mally is in the order of 0(nlog?i) [14], since a walk has an expected length of 
O(logn) and 0{n) walks are executed. However, for our scenario, we conjecture 
that the neighborhood size is bounded by a constant [15] so that the algorithm 
has a linear complexity which allows for an efficient execution on large data sets. 

4 Indexing with RRW Clusters 

Our goal is to develop a system that is able to complement the indexing infor- 
mation of documents in a scientific library. It should work with both libraries 
that are on the transition to a digital library and completely digital libraries. For 
our example application, we consider the hybrid library at Karlsruhe that offers 
a mix between conventional and digital services, for instance access to electronic 
journals, electronic publications or a digital document delivery service. The in- 
dexing system at this library is based on the classification scheme of the SWD 
Sachgruppen devised by the Deutsche Bibliothek [17]. This scheme offers four 
levels of classification, of which we only considered the first two due to the sparse 
classification mentioned in the introduction. Each of the categories has a numeri- 
cal classifier like 13.5, and a textual one, the keyword, like “PHOTOGRAPHY”. 
In total, the two-level classification scheme comprises 213 classes. 

Classification methods for documents can be categorized by several criteria 
used in literature, for an example consider Sebastiani [6] with the categories 
single- vs. multilabel, category- vs. document-pivoted and hard vs. ranking. 

Our system can assign multiple keywords per document, it is document- 
pivoted (this means that we try to find all keywords that match a given docu- 
ment) and gives a “hard” categorization, i.e. a binary decision whether a keyword 
should be assigned to a document or not. A ranking of keywords and a degree 
or probability of fit could be generated by our application, however, this is not 
supported by the indexing system of the library and thus will not be considered. 

With the clusters derived from clustering the usage histories, we proceed to 
use their inherent similarity information for automated indexing. The general 
idea is that clustering based on usage histories reveals semantic similarity be- 
tween documents without the need for any analysis of the documents’ content. 




Automated Indexing with Restricted Random Walks 239 

This is quite contrary to conventional approaches that have their foundations in 
information retrieval and try to gain insights in possible classifications by analyz- 
ing the text itself, for example by evaluating the distribution of word frequencies. 
Clearly, this approach is not feasible in a library where no digital full-text infor- 
mation exists for many documents and it calls for very sophisticated and efficient 
methods when categorizing large sets of digital full texts. 

Thus, analogously to ideas from recommender systems, the textual analysis 
is substituted by an analysis of usage histories. We conclude that, if a document 
j is contained in a cluster for a document i, the two have a high similarity. In 
a second step, when we consider the keyword frequencies assigned to j and the 
other documents in i’s cluster, these should also fit for i due to the similarity 
of the documents in the cluster to i. Consequently, if the cluster for a given 
document contains only documents having a certain keyword, the probability is 
very high that the keyword also fits the first document. 

As an example, consider the keywords derived from an example cluster be- 
longing to the book “UNIX system administration handbook” by Evi Nemeth 
at level 0.5: Of the 14 documents in this cluster, all 14 were associated with the 
index 30: “Computer Science and Data Processing” (There are no more specific 
sub-topics under this one), one had the index 31.9: “Electrical Engineering” and 
another one 10.11: “Business Administration”. 

In an earlier work [15] we scrutinized the performance of the clustering al- 
gorithm on large data sets. General information about the quality of the RRW 
method can be found in [14]. In this paper, we concentrate on the quality of 
keyword recommendations obtained from the clusters. 

We start by gathering indexing information about the documents having a 
usage history. Since the library does not offer direct access to their catalogue 
data, we query the WWW interface that returns an HTML page with embedded 
MAB (an electronic data exchange format devised by the Deutsche Bibliothek 
[18]) information. We extract the keywords as well as the document name and 
author and store them in a relational database. 

In the next step all documents are selected from this database that do not 
have any keywords associated as potential candidates for automated indexing. 
We then generate a set of clusters at different levels l for each of these documents. 
Within each cluster, we query the database for the keywords assigned to the 
documents in the respective cluster and count their frequencies. As a result, we 
obtain ti (i) , the total number of documents in the cluster for document i at level 
l that have any keyword and, for a keyword k, the number of documents in the 
cluster it has been assigned to that is denoted by 

We cannot use all keywords suggested by the algorithm since not all of them 
really fit. Instead, we use a minimum significance threshold for the judgement 
of the fit of a keyword-document combination. Three possible basic measures 
sig;(fc,?’) for the significance of a keyword k with respect to a given document i 
and its cluster at level l are conceivable: First, the absolute frequency 

sig i ba {k,i) = fi(k,i) , (4) 

second, its relative frequency 
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sig = (5) 

or an adjusted measure, 

( 6 ) 

The first two have a major drawback concerning the scale dependency. The 
absolute frequency (4) does not take into account the relative importance of the 
result. While the event of three documents out of four having one keyword in 
common is quite significant, three documents out of 30 do not imply a very good 
fit between keyword and document. On the other hand, if only one document 
with assigned keywords is found at all, the relative frequency (5) reaches a max- 
imum while this event could still be pure coincidence without any significance. 

5 Results 

For the evaluation, we first generated a database of random walks based on five 
walks starting from every document with a raw basket (total number of doc- 
uments: 562 295). Empirically, five walks suffice to give the algorithm enough 
stability. However, further investigations will have to aim at theoretically sup- 
porting these results. More than 40% of the documents having usage histories 
have no keywords which is an even higher proportion than in the random sample 
drawn from the OPAC (approximately 30%). This implies that in the context of 
our study, index information currently does not increase the probability that a 
document is actually used. 

The evaluation was carried out in two steps: First, for determining acceptable 
parameters for the algorithm (cutoff level for the clusters, choice of a significance 
measure and a significance threshold) for a random sample of 200 documents 
without keywords (sample A), clusters were generated by the algorithm, the 
keywords were extracted and manually evaluated by one of the authors. Second, 
based on these parameters, we extracted the keyword recommendations for a 
random sample of 15 000 documents with keywords (sample B) and compared 
the results of the algorithm with the keywords assigned by the librarians. 

For sample A, a grid of the cutoff levels for the clusters of 0.5, 0.61, 0.75 and 
1 was constructed (0.61 was the optimal level found in [15]). For these levels, 
we extracted the possible keywords as described in section 4 as well as their 
frequencies. These keywords were then screened by one of the authors - without 
any clue as to the ranking the algorithm gave those words so that a bias could 
be avoided - together with the bibliographic data of the corresponding book and 
had to be classified into the categories “fit” and “no fit”. (While evaluation by 
one of the authors is always problematic, a comparison of the results of sample 
A with sample B shows that the results of both samples are in line.) In total, 
for the level of 0.5 a total of 1160 keywords proposals for the 200 documents 
was found by the system, out of which 458 were found to be appropriate for 
the document in question and 702 were not. The results of the other levels were 
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Fig. 2. Keyword suggestion precisions for sigjp and sigQ g (interpolated). 



either too small or of lower quality than those at level 0.5. Consequently, we set 
the cutoff level for the second stage to 0.5. 

The choice of the significance measure sig ; £ {sigf bs , sig) el , sigp} proposed in 
equations (4) - (6) and the significance threshold is based on a complete grid 
search over all levels. With regard to precision, sigp dominates sig) el . Because 
of its different domain, sigp is incomparable. However, when we consider the 
number of suggested keywords at an acceptable precision level, Fig. 2 shows 
no dominant significance measure. sigf bs is ruled out and not drawn in Fig. 2 
because except for around 30 (not enough keyword suggestions) and above 390 
keyword suggestions (equals sig? dj ), it is inferior in precision. Depending on the 
number of keyword suggestions, a choice must be made between sig) el and sigp. 

We consider a precision of about 80% with at least 170 keyword suggestions as 
an acceptable compromise. Table 1 shows the results for the relevant parameters. 
Based on this, we chose sigpi at a significance threshold of 0.27 for the evaluation 
of sample B. For sample A this means that 114 documents out of 200 would have 
received one or more keyword assignments. 



Table 1. Error table for more than 170 suggestions at level 0.5. 



measure 


found ok 


found not ok 


precision 


significance threshold 


sigf bs 


138 


74 


0.6509 


3 




137 


35 


0.7965 


0.69 


sigp 


145 


36 


0.8011 


0.27 
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In a second stage, we checked the keyword recommendations for sample B 
- of course without using the keywords manually assigned to the documents by 
librarians. We used the cutoff level 0.5, the significance threshold 0.27 and the 
significance measure sig^ dJ as suggested by the analysis of sample A. The recom- 
mendations the algorithm returned were compared to the indexing information 
of these documents. We could show that the precision is slightly smaller (77.0%) 
when only direct hits (a document has the keyword 9.2 and the algorithm pro- 
poses 9.2) are counted. In that case, a recall of 66.8% could be reached. We 
attribute the slightly higher precision of sample A (0.8011, as evaluated by one 
of the authors) to higher small sample variance and a variance in the evalua- 
tions done by different librarians. In addition to this automated test, the manual 
evaluation of a larger sample by librarians is planned. 

6 Conclusion and Outlook 

In this paper, we proposed a (semi-) automated complement for the manual in- 
dexing of large document sets in a library. The advantage of the method is 
that - even though it uses data that are easier obtainable in a digital library - 
its principal functioning is independent of the representation of the documents. 
Furthermore, it does not require costly analyses of document content, but uses 
relatively light-weight usage data, which enables our method to handle large 
document sets in the order of some million documents with modest hardware 
requirements, for example a standard PC with a 1.5 GHz CPU and 1 GB RAM. 

Even for a document collection without keywords clusters can be generated, 
assigned keywords, and used to reduce the amount of manual indexing necessary. 
Furthermore, repeated application of the algorithm - although not studied in 
detail - will help in completing the covering of the whole set of unclassified 
documents, when interleaved with manual acceptance of certain keywords. 

Further research is still needed for some aspects of the underlying cluster- 
ing algorithm that is the key component in the good performance of the overall 
method. For instance, questions of convergence and stability of the results need 
some further considerations. Equally, many possibilities for fine-tuning the al- 
gorithm exist that have not yet been investigated. A more thorough evaluation 
of the quality of the results of this algorithm by reference librarians needs to be 
done. 

As for the aspect of keyword assignment, the use of semantic information in 
the keywords could yield further improvements of the results: If, for example, a 
book has high recommendations for the classifications 4.1 (philosophy, general 
subjects) and 4.4 (metaphysics), these two classifications support each other. 
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Abstract. This paper presents the results of our study regarding the dif- 
ferent facets and ways of using annotations in both digital libraries and 
collaboratories. This study represents an innovative attempt at gathering 
methodological tools and synergies from both fields in order to effectively 
define a comprehensive model for annotations. Thus we propose a con- 
ceptual model for annotations in order to develop an annotation service 
that can be plugged into digital libraries and collaboratories. Finally, 
starting from our model, we introduce a search strategy for exploiting 
annotations in order to search and retrieve relevant documents for a user 
query. 



1 Introduction 

The research field regarding the design and development of software systems, that 
are able to provide annotation capabilities on the content that they manage, e.g. 
digital libraries and collaboratories, is very active and productive. On the other 
hand the problem of how to incorporate annotations is usually faced separately 
in the field of digital libraries and collaboratories without exploiting the synergies 
that can be common to both fields. Our research work represents a first effort 
to face these issues together in both fields. This way we can benefit by the 
methodological tools coming from both fields in order to define a comprehensive 
model for annotations and to design an annotation service that can be seamlessly 
plugged into different digital libraries and collaboratories. 

The paper is organised as follows: the remainder of this section presents 
digital libraries and collaboratories and the beneficial usage of annotations in 
those fields. Section 2 discusses different angles about annotations, Section 3 
introduces our conceptual model for annotations and some access and retrieval 
strategies that exploit annotations; finally, Section 4 draws some conclusions and 
presents the future work. 

1.1 Digital Libraries and Collaboratories 

Digital libraries are not only the digital versions of traditional libraries, but offer 
means going beyond mere presentation of the content stored in a digital repos- 
itory. Two definitions of digital libraries, coming from two different directions 
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and thus focusing on different aspects, point to this fact. The more computer 
science oriented view is expressed in the introduction to the first issue of the 
International Journal on Digital Libraries (cited in [8]): 

Digital Libraries are concerned with the creation and management of in- 
formation resources, the movement of information across global networks 
and the effective use of this information by a wide range of users. 

Librarians have a different definition of Digital Libraries: 

Digital Libraries are organisations that provide the resources, including 
the specialised stuff, to select, structure, offer intellectual access to, inter- 
pret, distribute, preserve the integrity of, and ensure the persistence over 
time of collections of digital works so that they are readily and econom- 
ically available for use by a defined community or set of communities. 
(Digital Library Federation (DLF), 1998, cited in [8]) 

Both these definitions highlight some distinguishing features of digital li- 
braries: firstly the central point of both definitions is that information resources 
should be accessed and used ; then they further couple this concept with the one 
of community of users. In this way a digital library is jointly characterised by its 
collection of information resources and by the community of users for whom the 
collection is managed and made available. Other aspects addressed by the above 
definitions are the creation and interpretation of resources. The two definitions 
share the common view that information resources have to be accessed with the 
last point addressed in the definition of a collaboratory formulated by William 
Wulf, who sees such a collaboratory as 

...center without walls, in which nation’s researchers can perform their 
research without regard to geographical location - interacting with col- 
leagues, accessing instrumentation, sharing data and computation re- 
source, and accessing information in digital libraries. [14] 

Collaboratories focus on facilitating scientific interaction within a team. Be- 
sides this, they should support the sharing of data and resources. Figure 1 sum- 
marises the aspects of digital libraries and collaboratories. As we will discuss in 
the following, all these aspects are particularly relevant for annotations and they 
can greatly benefit from having annotations available as an additional tool. 

1.2 Annotations Within Digital Libraries 

Annotations can be exploited in order to realise the distinguishing features of 
digital libraries highlighted above. The creation of new information resources 
is supported by annotations in two ways. First, when users add annotations 
to existing information resources, these are new information resources them- 
selves. Second, annotations can also assist in the creation of new information 
resources. Through annotations, new ideas and concepts can be discussed and 
the results of such a discussion can then be integrated into the newly created 
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object. Annotations might increase and expand the information resources man- 
aged by the digital library. In this way, they may provide interpretations of 
information resources. User communities benefit from such interpretations in 
that they help understanding the annotated resource and contain additional 
information about it. In the Humanities, for instance, interpretation is one of 
the basic tasks scholars perform. Systems like COLLATE or IPSA support this 
task through annotations [1, 7]. Annotations support user communities in access- 
ing the information resources provided by the digital library in a personalised 
and customisable way: indeed users can create annotations that link different 
documents, enabling alternative paths for browsing digital contents and thus 
structuring them in alternative ways, like virtual books [19]. Different layers of 
annotations can coexist in the same document: a private layer of annotations ac- 
cessible only by the annotations author himself, a collective layer of annotations, 
shared by a team of people, and finally a public layer of annotations, accessible 
to all the users of the digital library; in this way user communities can benefit 
from different views of the information resources managed by the digital library 
[16,15]. Annotations can contain interpretations, reviews and additional infor- 
mation about the resources they belong to. They reflect what others say about 
a resource, which establishes an interesting context exploitable for information 
retrieval [7]. Furthermore the access and retrieval of information resources can 
be aided by means of automatic annotations. Employing topic detection tech- 
niques, a document can be segmented into topics of desired granularity and 
automatic annotations represent a summary of these topics. Then, exploiting 
automatic hypertext construction techniques [3], automatic annotations can be 
linked to the original document. Finally, the content of annotations can support 
the effective use of the digital resources. Automatic annotations, interpretations, 
alternative paths, and all other information contained in annotations help the 
user in approaching a document. 

1.3 Annotations Within Collaboratories 

As we pointed out above, the main characteristics of collaboratories are interac- 
tion, sharing and access. Annotations can be beneficial for all of them. Indeed, 
many systems use annotations to establish collaboration. Wilensky sees anno- 
tations as an example for spontaneous collaboration [23]. Interaction within a 
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community can be supported by means of shared or public annotations. In COL- 
LATE, annotations are used to model a scientific discourse between film scien- 
tists. The system supports strong collaboration through nested annotations; users 
can directly react to other users’ contributions and do not have to rely on tradi- 
tional means like e-mail or telephone [7]. Annotations are an important way to 
share one’s results with others. Shared or public annotations are visible to more 
persons than the author who created them. Systems supporting data sharing 
through annotations are, among others, those reported in [1,7,9,19]. Sharing 
data triggers at least weak collaboration - users can view others’ results, without 
necessarily directly reacting to them. The access aspect in collaboratories can 
be supported by annotations the same way as discussed in the last subsection. 

2 Annotations 

Since annotations intrinsically entail an active involvement of the users with 
information resources, they naturally bring digital libraries and collaboratories 
closer, so that it is advisable to investigate how to exploit methods and techniques 
coming from both fields in order to effectively employ annotations. Over the past 
years a lot of research work regarding annotations has been done [18, 20], which 
led to different viewpoints about what an annotation is. The following sections 
describe the different angles about annotations that we consider. 

2.1 Annotations as Metadata 

Annotations are considered as additional data about an existing content, that is 
annotations are metadata [18]. This reflects a data specific view on annotations. 
From a syntactic point of view one of the main characteristics of metadata is that 
it is connected to the object it refers to; annotations have a similar connection 
to what they are annotating. This way, they are indeed data about data. 

The World Wide Web Consortium (W3C) considers annotations as meta- 
data and interprets them as the first step in creating an infrastructure that 
will handle and associate metadata with content towards the Semantic Web 
[11]; examples are the Annotea Project 1 and the Extensible MultiModal Anno- 
tation (EMMA) 2 markup language. Also systems that employ annotations as 
an extension of bookmarks can fall within this definition. Indeed the additional 
data provided by annotations are exploited to describe, organise, categorise and 
search the bookmarks [13]. As a further example, MPEG-7, named “Multimedia 
Content Description Interface” and developed by the International Organization 
for Standardization (ISO) 3 , is a standard for describing the multimedia content 
data to be processed by a device or a computer code. Finally also automatic an- 
notations can be considered metadata, since they extract summary sentences or 
significant phrases from the document they annotate, thus providing additional 
data for highlighting the key-points of the document. 

1 http : // www . w3 . org/200 1 / Annotea/ 

2 http : //www. w3 . org/TR/2003/WD-emma- 200312 18/ 

! http : //www. iso . ch/ iso/en/prods-services/popstds/mpeg.html 
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2.2 Annotations as Content 

Another view on annotations is seeing them as content, reflecting an information 
specific view. Annotations can be regarded as content in two ways: they can be 
content about content and they can be considered as additional content [18]. 
Both ways do not mutually exclude each other: interpretations, for example, are 
content about content, but they might also contain additional content. Reviews 
and judgements, as another example, are basically content about content. 

Annotations being additional content augment existing content and allow 
the creation of new relationships among existing contents, by means of links 
that connect annotations together and with existing content. In this sense we 
can consider that existing content and annotations constitute a hypertext, ac- 
cording to the definition of hypertext provided in [5]. For example, [17] considers 
annotations as a natural way of enhancing hypertexts by actively engaging users 
with existing content in a digital library [16] . 

Normally digital libraries do not have a hypertext connecting documents with 
each other; thus annotations can represent a means for associating an hypertext 
to a digital library. In this way it is then possible to exploit the associated 
hypertext in order to enjoy alternative browsing paths and to perform advanced 
document searches, employing hypertext information retrieval techniques [4]. 

2.3 Annotations as Dialogue Acts 

Another viewpoint on annotations, regarding them as dialogue acts, covers a 
communication specific view. This view is concerned with the question of the 
pragmatics conveyed in annotations, i.e. the intention behind a user’s statement. 
Gaining information about pragmatics is an important means to distinguish be- 
tween the different kinds of content we have discussed in the last subsection. We 
may find out about the semantics of utterances in annotations, but this does not 
mean that we can distinguish whether we can see the annotation as content about 
content or an extension of existing content, or even something completely dif- 
ferent. This distinction might be important when applying appropriate retrieval 
functions, as we will see in Section 3.3. 

Each annotation implicitly consists of certain communicative acts , which, 
according to Searle can be classified as (among others) assertives, directives 
(e.g., requests), and commissives (e.g., promises) [22]. Communicative acts both 
allow for communication on the content and on the meta level. On the content 
level, assertives connected with a certain discourse structure relation are the 
units with which a coherent interpretation of the material can be created [6] . On 
the other hand, directives and commissives can trigger further collaborative acts 
on the meta level. Directives can be used to attempt to get some other person 
to do something; an example would be if a user asks the author of a comment 
if he could further elaborate on it. The author, in turn, can answer the request 
with a promise to provide the needed information (and actually provide it later 
on). Certain communicative acts can thus enable strong collaboration, and they 
can be realised as annotations. 
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3 Comprehensive Model of Annotations 

We aim to design and develop a comprehensive model for annotations able to 
address all the previously described facets and to define an appropriate strat- 
egy for exploiting annotations in searching and retrieving documents or other 
annotations. 

3.1 Design Choices 

Considering the complexity of the annotation and the need for a proper concep- 
tual model of annotation, as explained above, we have decided to make the effort 
of modelling the annotation using a conceptual modelling tool of general use as 
the Entity-Relationship (ER) model is. As introduced in our previous work [2], 
in order to capture the complex semantics of the annotation, which emerged 
also from the discussion of Section 2, we can distinguish between the meaning 
and sign of annotations. The meaning of annotation is a main aspect concern- 
ing the concept of annotation, which identifies conceptual differences within the 
semantics of the annotation. For example the different angles about annotations 
introduced in Section 2 can be considered as different meanings of annotation. 
Furthermore, within a given angle, we can identify different meanings of annota- 
tion; for example, within the “annotation as content” viewpoint we can point out 
three different meanings of annotation: comprehension and study, interpretation 
and divulgation, and revision and cooperation. The sign of annotation is a way 
of representing a meaning of annotation. For example we can identify a textual 
or a graphic sign of annotation. These basic signs can be combined together in 
order to create a more complex sign of annotation, capable to express complex 
meanings of annotation, such as those explained above. Thus an annotation is 
expressed by one or more signs of annotation, that in turn are characterised 
by one or more meanings of annotation, defining the overall semantics of the 
annotation. 

Before discussing the proposed conceptual schema, our reference architecture 
introduced in [2] has to be borne in mind: we aim to design and develop an 
annotation service that can be easily plugged into different digital libraries or 
collaboratories, allowing these systems to seamlessly extend their functionalities. 
As an important consequence of this architectural choice, we assume that the 
annotation service knows everything about annotations but it has no knowledge 
about documents managed by the system it is plugged into. This is due to the fact 
that the annotation service directly manages annotations while documents and 
information pertaining to them are provided by the system the annotation service 
is plugged into. Thus the annotation service deals with handles to document, 
that allow it to connect annotations to documents, without the need to actually 
manage them. 
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3.2 Annotation Conceptual Schema 

The proposed conceptual schema is shown in Figure 2. It is centred around two 
main issues: how to model annotations and how to connect them to information 
resources. The next sections describe these two issues in detail. 

How to Model Annotations. The Annotation entity represents the ab- 
straction of the annotation, i.e. it expresses the existence of an object capable 
of annotating another object, without further specifying its characteristic. This 
is the pivot entity, which provides the basis for modelling annotations. The An- 
notation entity owns the following attributes: ID is a unique identifier for the 
annotation, e.g. an Uniform Resource Identifier (URI); Created and Modified 
represent, respectively, the creation date and the last modified date of the an- 
notation; and Scope specifies if the annotation is private, shared by a team or 
public. 

The discussion carried out in the previous sections showed that the An- 
notation entity alone is not sufficient for covering the semantics of the general 
concept of annotation, so it needs to be partnered with two other entities Mean- 
ing and Sign, representing respectively the meaning of annotation and the sign 
of annotation. The Meaning entity is characterised by a unique identifier, ID, 
and by a Type, which describes the meaning of annotations. On the Meaning 
entity there is a recursive relationship, Contain, that expresses the existence 
of broader meanings and narrower meanings; thus the meanings of annotation 
can be organised in a simple hierarchy and some navigation facilities within this 
hierarchy can provided to the user. The Contain relationship expresses the fact 
that a meaning may be contained only in one other meaning and that it may 
contain one or more other meanings. The Sign entity owns an unique identifier, 
ID, and a Content attribute, which represent the actual content of the sign of 
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annotation, e.g. a piece of text. The SignType entity describes the kind of a 
sign of annotation, e.g. a textual sign or a graphic sign, and makes it possible to 
correctly interpret the Content attribute of a Sign. The SignType entity is con- 
nected to the Sign entity by means of the Typify relationship, which expresses 
the fact that a Sign must have exactly one SignType, while a SignType may 
specify one or more Sign entities. 

Two relationships, Express and Mean, allow the three entities Annota- 
tion, Meaning and SIGN to cooperate together for defining the semantics and 
the materialisation of an annotation. The Express relationship denotes that 
an Annotation entity have to be expressed at least by one Sign entity, and 
eventually more, and that a given Sign entity has to be employed in order to 
express one and only one Annotation entity. The attributes of Express al- 
low us to physically identify which part of the information resource has to be 
annotated. In particular the Pointer attribute identifies a portion of a digital 
object, e.g. it could be an XPath expression in case of an extensible Markup 
Language (XML) document; the Offset attribute selects a starting offset with 
respect to the portion identified by Pointer, e.g. the initial character within 
an XML element; finally the Extent attribute specifies the size of the sign of 
annotation, e.g. the number of characters that are annotated within the portion 
identified by Pointer starting from Offset. 

The Mean relationship expresses the fact that a Sign entity has to be related 
at least to one Meaning entity, and eventually more, and that a Meaning entity 
may characterise one or more Sign entities. 

How to Connect Annotations to Information Resources. As explained 
in the previous section, the Annotation entity represents the abstraction of 
an object capable of annotating another object. In order to connect annotations 
to information resources we need also an entity that represents the abstraction 
of an object that can be annotated; this entity is called DoHandle, which 
represents a digital object by means of an handle to it. Thus the cornerstones 
for connecting annotations to information resources are the Annotation and 
DoHandle entities which represents the fact that there are two kinds of related 
objects: digital objects that can be annotated and annotations that annotate 
those digital objects. 

The relationship between annotations and annotated digital objects is rep- 
resented by the Annotate relationship, which links an Annotation entity to 
the DoHandle entity it annotates. This relationship expresses the fact that an 
annotation must annotate one and only one digital object and that a digital ob- 
ject may be annotated by one or more annotations. Once we have annotated a 
digital object, the annotation itself can be considered as a digital object eligible 
to be annotated. Thus the conceptual schema has the additional constraint that, 
after that the annotation has been created, also an occurrence of the DoHandle 
entity corresponding to the annotation have to be added, in order to allow the 
newly created annotation to be annotated too. Users can therefore create not 
only sets of annotations concerning a digital object, but also threads of annota- 
tions - i.e. annotations in reply one to another - which are the basis for actively 
engaging users with the system and for enabling collaboration. 
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The RelateTo relationship is used for the purpose of relating to other dig- 
ital object and it associates a sign of an annotation with the digital object it 
refers to. This relationship holds between DoHandle and Sign and not between 
DoHandle and Annotation, because the Annotation, perceived as abstrac- 
tion, does not have to be related to a digital object, since its main purpose is to 
contribute to modelling the fact that there exist annotating objects and anno- 
tated object. On the contrary, the sign of annotation takes charge of relating to 
a digital object and the explanation of this relation is given by the meanings of 
annotation associated with that sign. The RelateTo relationship allows a Sign 
entity to refer or not to a digital object, while a digital object may be referred 
to by one or more signs of annotation. The attributes of RelateTo have the 
same meaning of the attributes of Express. On the whole the Express rela- 
tionship specifies the origin of the link and the RelateTo relationship identifies 
the destination of the link. 

Finally the User entity represents a user, granted by the system. The Own 
relationship relates an annotation with its author; a user may create one or more 
annotations, while an annotation must belong to one and only one user. 

The proposed conceptual schema provides us a great flexibility, because we 
can express the different aspects of an annotation, couple them together and it 
does not constrain us to fixed types of annotations for fixed tasks. So our proposal 
represents an enhancement and a generalisation with respect to [11,21]; in fact 
being a conceptual schema, our model can be easily mapped to different models, 
such as a relational schema, a Resource Description Framework (RDF) schema or 
a XML schema; this way it provides us great flexibility with respect to different 
architectural choices. 



3.3 Search and Retrieval Issues 

Although annotations are quite a useful and common concept in digital libraries 
and collaboratories, there do not exist many retrieval approaches taking an- 
notations into account. As one of the few examples, Golovchinsky et al. use 
annotations to construct full-text queries out of them [10]. What is missing so 
far are retrieval functions which are potentially able to take the different facets 
of annotations, which constitute a valuable context for document retrieval, into 
account. 

As we have seen in Section 2.2, annotations and the referenced resources 
constitute a hypertext. This makes hypertext information retrieval approaches 
[4] potential candidates to be adapted to annotation-based retrieval. On the 
other hand, annotations might also be content about content such as reviews 
containing judgements about documents and thus cannot be seen as an extention 
of the document content. Nevertheless, such annotations contain information 
which are appropriate to take relevance criteria other than just topicality into 
account. Consider an example of a digital library where students could give 
judgements about documents, like “this book is a very good introduction”, by 
annotating them. Another student might search for books which introduce her to 
the field of digital libraries. In this scenario we can see that the actual information 
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need can be mapped onto two queries: One made to the set of documents for 
topicality (e.g., qdoc = “digital libraries”), and one made to the set of annotations 
(e.g., q a nn = “good introduction”). This results into a composed query q = 
( qdoc , qann)- When seeing retrieval as uncertain inference, the retrieval weight of 
a document d w.r.t the query q is determined by the probability that d implies 
q. A retrieval function calculating P{d —* q) considering the annotation context 
of d could roughly be outlined as, for example, 

P(d > q) — ( Pdoc{d > qdoc) T Panned > qann) ^) ‘ (1 Panned qann)) 

O = Pdoc{d — > qdoc) • Panned — > q an n) reflects the possible jointness of events. 
The value for Pd oc is determined by the weight of d w.r.t. qd oc whereas the com- 
putation of P a nn is based on the annotations made on d. Negative annotations 
made on the document increase the probability P an n{d q a nn) that the docu- 
ment does not imply the query and thus decrease P(d — > q). The calculation of 
Pann might need an in-depth analysis of the annotation thread as discussed in 
[7] 4 . When seeing annotations as additional content rather than judgements, we 
do not need a composed query; in this case, it is q — qdoc = qann- 

4 Conclusions 

In this paper we have discussed the several ways in which digital libraries and 
collaboratories can benefit from annotations. We have shown different viewpoints 
on annotations, which can be seen as metadata, content, and dialogue acts. These 
realise a data, information and communication specific view on annotations. All 
our thoughts led to the presentation of annotation models covering a conceptual 
model and search and retrieval issues. 

The proposed conceptual model for annotations is capable to represent the 
different viewpoints concerning annotations and enables the design and develop- 
ment of advanced retrieval functions. Furthermore it can be easily mapped to dif- 
ferent models, such as a relational schema, an RDF schema or an XML schema, 
and it is suitable for developing an annotation service that can be seamlessly 
plugged into different digital libraries and collaboratories. Our considerations 
will be borne in mind for the specification and realisation of an annotation ser- 
vice within the BRICKS project 5 which aims at establishing the organisational 
and technological foundations of a Digital Library at the level of a European 
Digital Memory. 

With respect to annotations supporting access and retrieval in both digital 
libraries and collaboratories we have shown that there do not exist many retrieval 
models for annotation-based document retrieval. Our discussion of this resulted 
in an outline of a potential retrieval function based on the view of retrieval as 
uncertain inference. This function incorporates positive and negative evidence 

4 The proposed retrieval function can be seen as a generalisation of the one presented 

in [7], 

5 http://www.bricksfactory.org 
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found both in the document content and in the according annotation thread. By 
applying such a retrieval function, relevance criteria other than topicality can be 
considered to satisfy users’ information needs. Future research will discuss this 
issue more thoroughly and introduce suitable retrieval functions more precisely. 
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Abstract. Advanced personalization techniques are required to cope with novel 
challenges posed by attribute-rich MPEG-7 based digital libraries. At the heart 
of our deeply personalized news dissemination system P-News is one extensible 
preference model that serves all purposes, preventing impedance mismatches 
between the various stages: User modeling by structured preference patterns, 
automatic query expansion including ontologies, preference query evaluation 
by Preference XPath including nested preferences on categorical data, quality 
assessment of query results, personalized notification and news syndication. 



1 Introduction 

The amount of information available in digital libraries via the Internet is overwhelm- 
ing and the task of extracting all valuable knowledge increasingly time-consuming. 
Especially in areas with short innovation cycles like IT not only new documents ar- 
rive in large bulks every day, but also what is considered to be relevant will strongly 
differ among various user groups like business-oriented consultants, technology- 
oriented developers or highly specialized researchers. Users find themselves con- 
fronted with a well-known dilemma: spending too much time on going through new, 
but probably irrelevant information will cost valuable research or working time, 
whereas spending less time may result in missing some vital information. Many users 
of digital libraries or subscribers of news services have suffered the troublesome 
problem of getting 'properly' notified about latest publications. Definitively this 
should happen in a personalized manner as much as possible. The acquisition and 
maintenance of user preferences about topics of interest and preferred content charac- 
teristics are prerequisites for better solutions than current ad hoc approaches. The P- 
News 1 project tackles this challenge by applying a highly flexible preference method- 
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ology with powerful query capabilities and managed usage stereotypes throughout the 
process of dissemination. 

Previous approaches for news dissemination are mainly focusing on IR-techniques 
matching a set of (weighted) keywords against the document collection. The 
introduction of structured documents in XML and related meta-data lead to applying 
1R techniques on well defined parts of the document like its title or annotations. How- 
ever, recent work in personalized information systems shows that not only the docu- 
ment is structured, but already the user’s query, sets of keywords, notions of rele- 
vance, preferences for notification, etc. The contribution of this paper therefore is 
twofold; first we present an intuitive way for users to express a structure on their 
information needs and preferences accompanying the entire process of dissemination 
way beyond Boolean combinations. Secondly we focus on different roles for interac- 
tion and providing predefined structures of the information to express common 
knowledge and thus ease the usage of the system. Unlike common dissemination 
engines, notifications in P-News are supposed not only to involve matching content- 
based preferences, but also closely adapting to particular users and situations. Consi- 
der a sample scenario: 

Example: Cathy is a professor at university and besides research projects also man- 
ages a spin-off business. She uses fast Internet access with a PC, but also uses a mo- 
bile phone to keep track of current events. Cathy of course wants to know about news 
related to her research and is interested in specific business news. Assume a suitable 
document arrives at P-News, e.g. a new research article. The dissemination process 
first has to recognize from the representation of Cathy’s topical preferences that the 
new article is relevant , from her quality preferences that its degree of relevance justi- 
fies a notification and from her notification preference, how to syndicate the docu- 
ment and where to deliver it to. For instance research-related items could always be 
sent as emails containing the full document. Her preferences as a business woman 
could also consider current business events as interesting enough triggering notifica- 
tion to the WAP cell-phone. Due to its limited capabilities, the syndication will auto- 
matically put up only the headline and a short abstract for delivery. 

Please note that Cathy can interact with the system in different roles having differ- 
ent (even contradictory) topical preferences, can have different notions of relevance, 
i.e. what degree of quality she is willing to accept within these different roles, and can 
have different preferences on how to be notified in each role. Moreover, in each role 
there will generally be certain stereotypical preference patterns for user groups. To 
cater for such deep personalization scenarios we need more powerful techniques than 
today’s publish/subscribe technologies or IR-based keyword searches in XML docu- 
ments. In this paper we will address all relevant topics towards building such a deeply 
personalized dissemination engine using a single consistent preference model for all 
types of preferences. We will show step by step how to tackle each necessary task 
and present innovative techniques for the application in news dissemination. We will 
in detail discuss its impact on specialized digital libraries with a focus on the use of 
categorical metadata and attribute-based searches in an intuitive and cooperative 
fashion. 
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The paper is organized as follows: section 2 will revisit related work and MPEG-7 
metadata. In section 3 we present the technological innovations for our news dissemi- 
nation task. Section 4 finally will show how to build these techniques into a running 
system, and will present a use case interaction lifecycle. 



2 News Dissemination 

2.1 Related Work in News Dissemination 

As a first approach to overcome the problem of having to sift through vast amounts of 
information, publishing companies have introduced customizable news letters. Users 
can subscribe to a variety of general terms describing areas of interest, and are peri- 
odically informed which new books might be of specific interest (e.g. ‘Springer- 
Alerts’ [1]). However, subscription services have still a long way towards personal- 
ization, because users might not be offered a specifically interesting category they 
need, might find that the publisher has chosen too broad/narrow terms as categories, 
or may have an entirely different understanding of some categories altogether. 

The area of news dissemination therefore has moved to employing advanced tech- 
niques for keyword matching in the texts of documents. Engines like SIFT [2] show 
already good results for full-text retrieval featuring IR techniques and prove that the 
task of finding relevant documents for notification can be efficiently performed even 
for large numbers of concurrent users. With the advent of XML engines for the 
search in XML documents like XIRQL [3] or XXL [4] applied the probability-based 
keyword retrieval to structured XML documents. However, none of these techniques 
focused on retrieval over structured categorical attributes that is needed for deep per- 
sonalization, i.e. not only the document structure should be taken into account, but 
also the structure of query terms (keywords, attributes,...) with respect to each other. 

In terms of advanced usability of digital libraries and ease of querying user profile 
modeling has been proposed, see e.g. [5], and already proved its usefulness through 
advanced personalization in the field of news dissemination [6], User models can be 
automatically expanded and profit from already existing similar user profiles. In this 
respect also the mining of related information to adapt recommendations more closely 
to the individual user has been applied [7], Our work is a direct continuation of these 
advances; however, we smoothly embed these advances in a powerful preference 
framework and thus do not have the overhead of managing user profiles, mined data 
or query terms of different structure. Moreover, we enhance the benefits by using an 
ontology-based approach to incorporate common domain knowledge into the retrieval 
process respecting the role of each individual user. 

In today’s digital libraries compound documents containing text, images, audio or 
even video files together with adequate annotations or comments are quite common. 
Acknowledging its necessity, within standardized multimedia description frameworks 
like MPEG-7 already a simple set of description tools for describing user preferences 
( UserPreferences ) [8] has been provided. It enables users to select their preferred 
multimedia content in terms of attributes related to the creation, classification, and 
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source of the content. Multiple preference components can then be organized into a 
hierarchical structure, each one carrying a numerical value indicating the relative 
importance of this preference. The expressiveness of UserPreferences descriptions, 
however, is far more limited than our approach, since only exact matching is sup- 
ported, even for simple numerical attributes (e.g. media duration time). 



<CreationInf ormation> 
<Creation> 

<Creator> 

<Role> 

<Name>Producer</Name> 

</Role> 

<Agent> 

<Name> I BLab s < / Name> 
</Agent> 

</Creator> 

</Creation> 

</ Creationlnf ormation> 



(a) 



<CreationInf ormation> 
<Creation> 

<Creator> 

<Role> 

<Name>Producer</Name> 

</Role> 

<Agent> 

<Name>Microsof t</Name> 
</Agent> 

</Creator> 

<Creator> 

<Role> 

<Name>Edi tor</Name> 
</Role> 

<Agent> 

<Name>IBLabs</Name> 

</Agent> 

</Creator> 

</Creation> 

</ Creationlnf ormation> 

(b) 



Fig. 1. Example MPEG-7 descriptions 



2.2 MPEG-7 Annotations and Use in Digital Libraries 

The introduction of standardized metadata descriptions facilitates search and retrieval 
of multimedia content in digital libraries. Currently MPEG-7 [8] is the most complete 
description standard for multimedia data providing a comprehensive set of standard- 
ized tools to describe multimedia data. For example, a video segment can be de- 
scribed in many different aspects, like Medialnformation (e.g. storage format, visual 
coding), Creationlnformation (e.g. title, creator, classification), Usagelnformation 
(e.g. access rights, distributor), structural aspects (e.g. subsegments) and conceptual 
aspects (e.g. text annotation, semantics). The description tools are specified in the 
Description Definition Language based on XML schema. Thus MPEG-7 descriptions 
are complex XML documents. For instance (a) and (b) in Fig. 1 are excerpts from 
MPEG-7 descriptions of the creation information for two video segments. 

With this standard the focus in searching digital libraries or evaluating a docu- 
ment’s relevance shifts to attribute-rich search on categorical data. Consider for 
example the information in Fig. 1. A query on all documents that have been created 
and preferably produced by ‘IB Labs’ will need more than today’s capability of 
searching the creator tag in e.g. XQuery for the existence of the keywords TBLabs’ 
and ‘Producer’ . Here the evaluation of a nested preference is needed where the key- 
word ‘Producer’ has to be the role within the same creator tag that also contains the 
keyword ‘IB Labs’. Thus document (a) in figure 1 would be a better match than 
document (b). However, document (b) should nevertheless still be considered more 
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relevant than other documents because of the keyword ‘IBLabs’ as ‘Editor’, which is 
another role of a creator. So the creator attribute has certain categories (‘producer’, 
‘author’, ‘editor’) as domain on which users might express preferences. Traditional 
IR techniques or the MPEG-7 description tools, however, can handle such prefer- 
ences only up to a certain extent yet, and provide no intuitive (i.e. declarative) way of 
expressing them. In the following we will show how our system can deal even with 
such complex preferences. 



3 Basic Concepts of Deep Personalization for Dissemination 

3.1 A Model for Structured User Preferences in XML Libraries 

MPEG-7 descriptions are basically XML data. In searching or filtering XML data, 
traditional IR approaches, keyword sets or vectors, are often unable to refer to the 
data structure and incorporate semantic relations between the query terms. Query 
languages for XML, such as XPath or XQuery, can be used to formulate precise que- 
ries over data. But they can only express Boolean (or hard) conditions; no ranking or 
soft conditions are possible. Many efforts are spent on combining the structure and 
ranking-based search, e.g. XXL [4] and XIRQL [3]. They mainly use vague predi- 
cates and probability-based combination function to score structurally matched 
document fragments. However, numerical ranking approaches are generally less ex- 
pressive than qualitative ones [21]. 

In [9], an approach for preference modeling is proposed utilizing strict partial or- 
ders featuring an intuitive “I like A better than B” semantics. User preferences are 
generally considered soft conditions that are evaluated as strict partial orders over the 
data set. Thus all best matching objects, not necessarily exact matches, will be re- 
turned. As an essential feature of the approach in [9], a set of predefined preference 
constructors are used to construct arbitrary preferences. For example, AROUND(x) is 
a base preference constructor on numerical attribute values preferring values closest 
to the stated value x. Pareto and Prioritized are complex constructors for combining 
preferences of equal importance or with priorities. This set of constructors is extensi- 
ble and users are enabled to define their own constructors if needed. For P-News we 
will add a novel preference constructor, called nested preference, extending the model 
to handle complex XML data. Formally let <path> be an XML path expression and 
let dom(<path>) denote the domain of objects reachable by <path>. Then a prefer- 
ence P is defined as P = ( <path> , < p ), where < p is a strict partial order over 
dom(<path>). 

Definition 3.1 Strict partial order relation between sets 

Given P = ( <path> , < p ) we define a strict partial order « p over the set of all finite 
subsets of dom (<path>) as follows. For all finite subsets X, Y of dom(<path>): 

X « p Y iff (V x e X, 3 y e Y: x < p y) a Y ^ 0 
It can be proved that « p is a strict partial order. 
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Definition 3.2 Nested preference 

Given a preference P = ( <path> , < p ) and objects C). O k e dom (<path*>), let 
{O k .<patli>} and ( 0,.<path> ) denote the sets of selected objects by navigating 
<path> from O and O k , respectively. Then a nested preference P* = ( <path*> , < ) is 
defined as: V O, O k e dom (<path*>): O. < p , O k iff {O f <path>} « p {O k .<path > j 

Preference XPath [10] has been developed to evaluate preference queries. It ex- 
tends standard XPath by soft filtering conditions bracketed in '#[' and ']# ' in contrast 
to hard conditions in brackets ' [' and '] '. A soft condition defines a strict partial order 
over the set of elements to be filtered and returns only the best matches. Extending 
the preference model by nested preferences, we extend Preference XPath as follows: 
LocationStep : axis nodetest (predicate | pref_pred) * 
pref_pred is the extension part to the standard XPath, i.e. soft filtering conditions. 

pref pred : '#[' preference ']#' 

preference : base preference 

| xpath 1 { 1 preference 1 } 1 
| preference 'and' preference 
| preference 'prior to 1 preference 

base preference is defined on atomic attribute values, e.g. strings or numbers. The 
second case, xpath '{'preference '}', is the extension for nested preferences. The 'and' 
and 'prior to' are for Pareto and Prioritized combinations respectively. 

Example (cont.): Assume Cathy prefers videos produced by 'IBLabs'. Using the 
schema of MPEG-7, such a preference can only be expressed in a nested way: 

/Mpeg7 /Description/MultimediaContent//* 

# [Creationlnf ormat ion/ Creation/ Creator 

{Role/Name is 'Producer' and Agent/Name is 'IBLabs'}]# 

Here ‘Role/Name is 'Producer” and ‘Agent/Name is ’IBLabs” are base preferences 
combined by the Pareto constructor 'and', which means both preferences are equally 
important. The combined preference induces a strict partial order on objects of Crea- 
tionlnformation/Creation/Creator, which in turn gives a strict partial order on higher 
level elements, i.e. multimedia segments accessed by /Mpeg7/Description/Multimedi- 
aContent//*; thus a nested preference. Evaluating the query on the documents in Fig. 
1, for each Creator in (b), there is a better Creator in (a). So video segment (a) is con- 
sidered better than (b) and the intuitively expected result set { (a) } will be returned. 

It is important to note that keyword search on full-text attributes using standard IR 
methods can be orthogonally embedded in our preference model. For instance, key- 
word search using the vector space model can be implemented as a basic rank F con- 
structor [9] on the full-text attribute, possibly combined with preferences on other 
attributes [11]. Thus, P-News caters for arbitrary queries on XML data. 

3.2 Structured Preference Patterns of User Groups 

For meta-data-based document retrieval in digital libraries users will always assume a 
certain amount of common knowledge within the system. In recent years ontologies 
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as a way of representing common knowledge or shared vocabulary within a domain 
have spread widely [13]. With ontologies we can model complex semantic relation- 
ships and exploit them for subsequent structured querying. As shown in [12] expand- 
ing queries along certain ontology-based patterns will result in improved querying, 
because the choice of useful preferences for relaxation can be limited down to a sen- 
sible applicable set. While an ontology can be as complicated as arbitrary semantic 
graphs including various relationships and inference rules, P-News does not attempt 
entirely ontology-driven querying, but instead uses ontologies in their most basic 
incarnation: concept hierarchies. Such ontological information poses a partial order 
on the string-valued domain set, e.g. the specified term is preferred to its synonyms 
and hyponyms, which in turn are preferred to its hypernyms, while the hypernyms are 
still considered better than other values. When a preference query is evaluated, the 
partial order induced by the ontology structure is respected. The basic technique is to 
use the EXPLICIT [9] or a user-defined preference constructor to expand the original 
query. 

Since different interest groups often show different but within the group suffi- 
ciently similar interests, useful default values can be assumed for all preferences not 
explicitly provided by a user. We refer to what is typically considered relevant in 
different user groups as default preference patterns. These patterns are predefined and 
can evolve over the usage cycle of group members by the feedback given. As in dis- 
semination frameworks unnecessary notifications have to be avoided, integrating this 
common knowledge into user queries in an unstructured way would often confront 
users with lots of useless results. With the preference query model, we can use the 
“prior to’’ preference combination to integrate the user group’s preference pattern into 
user’s query, however in a lower level of priority. Figure 2 shows the structure of 
query terms induced by query expansion with ontologies and preference patterns. 



user's preference query: 




I 



prior to 



r 



I user group's preference pattern: 




Fig. 2. Query expansion with ontologies and preference patterns 



Example (cont.): Consider that our sample user Cathy might research concepts of 
object-oriented programming in Java. Our system has a simple ontology as concept 
hierarchy in IT domain that models the terms ‘Java’, ‘C++’, ‘Smalltalk’ as subtypes 
of ‘object-oriented languages’, which in turn is a subtype of ‘programming lan- 
guages’. Note that ‘C++’ and ‘Smalltalk’ are not synonyms for ‘Java’, but unlike 
systems developers focusing on Java programming, researchers, who are interested in 
the basic concepts of Java might with high probability also be interested in C++ con- 
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cepts. Thus, if Cathy has a POS preference on ‘Java’, and a video talk on ‘C++’ ar- 
rives, we might consider a notification based on the specific group or pattern Cathy 
belongs to. However, we can always ensure by not relaxing above the ‘object- 
oriented languages’ node in our ontology that query results have still enough in 
common with Cathy’s query. Moreover, individual preferences always override pref- 
erences of the group. Now assume that a significant number of Cathy’s group mem- 
bers have stated to be no longer interested in ‘Smalltalk’. After query expansion with 
ontologies and group preferences, Cathy’s original query (represented in the compact 
algebra notation of [9] due to the limited space), P = POS(keyword, {‘Java’}), is 
expanded into P' = EXPLICIT(keyword, {‘C++’ < ’object-oriented languages’, 
‘Smalltalk’ < ’object-oriented languages’, ‘object-oriented languages’ < ‘Java’}) 
prior to NEG(keyword, { ‘Smalltalk’ }). Hence, we get a new single preference query 
that can simply be evaluated like before, but now takes common interests of a user 
group into account. 



3.3 Selecting Best Matches by Assessing Result Quality 

The knowledge maintained in the structure of preferences can also be used to assess 
the quality of retrieved results. To distinguish relevant from non-relevant objects 
declarative query languages offer capabilities that will stop the relaxation at a certain 
degree of generalization or will only relax within a certain range of objects, e.g. in the 
previous example Cathy’s constraints were only relaxed to languages that are object- 
oriented. P-News takes quality assessment beyond mere numerical thresholds to a 
relaxation directed by the needs of each individual user even for categorical data. To 
assess the quality of each object returned by a query, the preference structure of the 
query is compared with the actual matches of attributes/keywords like shown in [18]. 
Our model uses different linguistic quality levels to express the relevance of an object 
ranging from ‘sufficient’, ‘acceptable’, ‘good’, ‘very good’ to ‘perfect match’. The 
user can individually assign his/her perception of these levels for each base prefer- 
ence, e.g. in the case of keyword matching using standard IR distance measures or in 
the case of categorical data using the tolerated discrepancy for relaxation. Within 
complex preferences these basic measures are then aggregated again according to the 
specifications of the user, e.g. median , maximum or minimum (see [22] for details). 

Example (cont.): Assume Cathy prefers videos produced by ‘IB Labs’, but would 
also rather prefer files with a size of about 250 MB. She only wants to be notified of 
results having ‘very good’ quality or better. Evaluating the nested keyword matching 
Cathy can easily define acceptable quality thresholds. For the numerical attribute, file 
size, Cathy may state a tolerable deviation of 10%. For each 10% more or less the 
quality level drops one step. Assume that P-News has to compare documents (a) and 
(b) from Fig. 1 for possible notification with file sizes {(a), 310 MB} and {(b), 240 
MB}. In terms of quality on two base preferences this leads to {(a), perfect, good} 
and {(b), very good, very good}. Now Cathy again can express her exact needs in 
terms of overall quality. She might e.g. define a maximum of ‘very good’ as sufficient 
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and gets notified about both documents or she might only be willing to accept docu- 
ments with a minimum of ‘very good’ and gets notified about document (b) only. In 
any case she is enabled to intuitively specify her notion of relevance. 



4 The P-News System for Personalized Dissemination 

Having presented the underlying technologies to build a running system let us now 
take a closer look on the prototype system’s architecture and a lifecycle of interac- 
tion. 



Visual Preference 
Modeling Interface 



Query Interface 



Internet 

Delivery 



Preference 

Patterns 



Personallzer 

Composer & Synthesizer 



Preference Manager 



Preference 

Repository 



Domain 

Repository 



Search Engine 

(Preference XPath) 



Test Library 
(MPEG-7) 



Fig. 3. General architecture of the P-News system 



4.1 P-News MPEG-7 Library and System Architecture 

In P-News MPEG-7 annotations for multimedia content are stored in XML reposito- 
ries, associated with a query engine for evaluating Preference XPath queries [10]. As 
in our running example, we set our use case in the domain of IT technology. The test 
library consists of about 90 videos (ca. 24 GB) from Computer Chronicles [14] and 
colloquium series of computer science department in the University of Washington. 
All videos have been manually annotated using MovieTool [15]. The annotations 
focus on a set of controlled-term attributes provided by the MPEG-7 standard, whose 
values are from a predefined vocabulary. This use of controlled vocabulary leads to a 
standardized annotation and subsequent querying with categorical data that shows a 
strictly typed structure and thus makes user queries comparable within user groups. 

Fig. 3. sketches the general architecture of the P-News system. The central compo- 
nent of our architecture is called personalizer. It composes the user query by integrat- 
ing a user’s preferences. Then it expands the query using the preference pattern of the 
user’s group (respecting his/her current role) and poses it to the retrieval engine. The 
result is evaluated and all data whose quality assessment allows for notification are 
syndicated into the preferred format, layout, etc. and delivered to the appropriate 
client device. To adapt his/her stored preferences or enter new preferences our archi- 
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tecture also provides a visual preference modeling interface for users to graphically 
construct structured preferences. These components are built on top of a preference 
manager that manages the different kinds of preferences. For the query composition 
and expansion the user provided preferences and domain-specific knowledge grouped 
into preference patterns are used. Furthermore, each user’s notion of relevance (i.e. 
the expected quality of the results) is represented within the patterns stating individu- 
ally how much relaxation is still acceptable. Notification preferences of each user are 
specified for content syndication. Please note, though all these different preferences 
are used for specific tasks, their basic structure is always defined by our underlying 
preference model. Arbitrarily complex preferences thus can be stored in the same 
repository individually characterized by the user they belong to, and the group, in 
which they are applicable. We store all the preferences in XML format; our reposi- 
tory [16], however, distinguishes between user preferences, the group specific prefer- 
ence patterns, and the ontologies stored in the domain repository. Using Preference 
XPath, the preference manager chooses all applicable preferences for the personal- 
izer. 

4.2 The User Interaction Lifecycle 

User Registration and Preference Modeling. Our use case focuses on a digital 
library of IT-related content annotated in MPEG-7. The system maintains a set of 
predefined preference patterns for different user groups. Each new user registering 
with the system is provided a default preference pattern from the user group that 
he/she has been assigned to. The user now can view, edit, construct or remove his/her 
individual preferences using a visual interface, before they are stored in a preference 
repository. 

Query Composition. As discussed in section 3.1, complex preference queries can 
include keyword search, attribute matching etc. In MPEG-7, there is a predefined 
vocabulary associated with each controlled-term attribute, which can be viewed as a 
simple ontology specifying all the valid terms and subsume relationship between 
terms for the attribute. For each user group there is an ontology representing their 
view of the world, i.e. a taxonomy of all the categories, topics and keywords in IT 
domain, which is applied to the full-text attributes. When scheduling user queries P- 
News expands them with the predefined vocabularies or ontologies and group- 
specific preference patterns as discussed in section 3.2. 

Query Evaluation. According to timing information specified for each query, i.e. 
StartTime, Interval and ExpireTime, queries are activated periodically and evaluated 
over the set of new data, i.e. the data that come into the system between the last proc- 
essing time of the query and the current time. To improve the efficiency and/or seal- 
ability, the common parts of the activated queries, e.g. common data sets, common 
XPath expressions, are identified and the computation is shared among them. Existing 
work on multiple XML query evaluation have been adapted for this purpose [17]. 

Quality Assessment. Before making a decision on notifying a user, P-News analyzes 
the returned results in terms of quality as discussed in section 3.3, and filters objects, 
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for which soft constraints have been too far relaxed. The preference structure of the 
query is constructed and compared with respective structure of the matches in the 
result set. P-News enables users to specify the acceptable distances for relaxation on 
each base preference and computes quality values for complex preferences again 
according to user preferences. Thus an overall quality value for the object in terms of 
the query can be computed inductively and all irrelevant objects discarded. 

Notification and News Delivery. When P-News decides for notification, the system 
sends a message summarizing new media data that might be of interest. This message 
is syndicated according to the user-specified form and sent to the preferred device 
(see [19] for details on personalized multimedia content delivery). By default P-News 
notifications are simple e-mails listing the interesting documents and containing links 
to the full multimedia content. When the user follows a link, the P-News server will 
automatically adapt the content to user’s technical device characteristics. Currently 
we use SMIL (Synchronized Multimedia Integration Language [20]) files to deliver 
video data. SMIL also offers a set of attributes facilitating limited adaptations. 

Relevance Feedback. Our current implementation features only a limited approach 
to exploit relevance feedback. For example, P-News assumes a user to be interested 
in a particular video (and related keywords/topics), when he/she follows the link to 
open a video after reading the respective abstract. This behavior is recorded on the 
server and can be used for modifying the user’s preference as well as his group’s 
preference pattern. Deriving good patterns from feedback will be part of P-News 
ongoing work. 



5 Summary and Outlook 

In this paper we addressed the problem of deep personalization in news dissemination 
systems. The key technology in implementing the P-News system is one coherent 
extensible preference model that serves all purposes, preventing an impedance mis- 
match between various stages. Enabling the user to apply individual preferences in 
every single step throughout the dissemination process, P-News facilitates a tailored 
notification about relevant documents in a digital library. We focused on multimedia 
documents described by MPEG-7 metadata to allow users to express their preferred 
content, notion of relevance and delivery preferences in an intuitive way. We pre- 
sented an extension by nested preferences that are essential in structured querying of 
XML documents, using our unique Preference XPath. Moreover, in addition to ex- 
ploit the document structure to gain better result sets, we also allowed for expressing 
preference structures on users’ preferred keywords or categorical attribute values. 
Since users should not be burdened with all the extensive modeling of preferences 
within stereotypical interest groups, an ontology-based approach for automatic query 
expansion with typical preference patterns has been realized. Finally, we enabled 
users also to specify their individual quality preferences to avoid unnecessary notifi- 
cations as far as possible. Merging these techniques into the workflow of dissemina- 
tion the P-News system essentially extends the expressiveness of dissemination in 
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digital libraries catering for both textual information and categorical metadata de- 
scriptions. 

Our future work will concentrate on the experimental evaluation of the system, 
user case studies and the relevance feedback used to reflect current changes of inter- 
est within the user groups. Also managing users changing between different groups 
needs some deeper research. A detailed analysis of each individual user’s interaction 
with the delivered results can be expected to allow for monitoring dynamic changes 
in each group’s profile. Finally, to keep up with standards, we will migrate from 
Preference XPath to Preference XQuery. 
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Abstract. In this paper we explore the uniqueness of paper recommendation for 
e-learning systems through a human-subject study. Experiment results showed 
that the majority of learners have struggled to reach a ‘ harmony ’ between their 
interest and educational goal: they admit that in order to acquire new knowl- 
edge, they are willing to read not-interesting-yet-pedagogically-useful papers. 
In other words, learners seem to be more tolerant than users in commercial re- 
commender systems. Nevertheless, as educators, we should still maintain a bal- 
ance of recommending interesting papers and pedagogically helpful ones in or- 
der to retain learners and continuously engage them throughout the learning 
process. 



1 Introduction 

One of the criteria used in recommendation systems is consumers’ interest toward the 
item being recommended. For instance, a recommendation system in an e-shop will 
recommend a new jazz CD to a customer who likes jazz music; or in the case of digi- 
tal library, the system recommends a journal article to a scholar whose research inter- 
est matches the topic of the article. In e-business, recommending items according to 
the customers’ interest has been proven to be effective for cross-selling, up-selling, 
and mass marketing [9]. In e-learning, however, the effectiveness of recommendation 
based solely on learners’ interest has not been studied extensively yet. In our previous 
survey, it is shown that most learners are willing to read not-interested-but-useful 
paper [10]. The result supports our conjecture that learner interest may not be the only 
factor that affects the effectiveness of recommendation in e-learning. We believe that 
learners’ knowledge of domain concepts and learning goals might be more important. 
Therefore, the system should not recommend highly technical papers to a first-year- 
undergraduate student or popular-magazine articles to a senior-graduate student. To 
illustrate the uniqueness of making recommendation for e-learning systems, let us 
first look at a motivational example. 

A Motivational Example 

User A and B are to receive recommendations on research articles and news stories 
respectively as shown in Table 1. For user A, s/he is expecting an item which is not 
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only interesting, but also understandable and useful. Even if an article is interesting, 
we cannot recommend it if user A is not yet pedagogically ready (without enough 
prerequisite knowledge) to consume it. Hence, we should recommend user A some 
other article(s) before an interesting but pedagogically not suitable item is given. For 
user B, however, we can simply recommend items of interest to her/him. Therefore, it 
is obvious that in order to make recommendation for educational system, we should 
consider a very important feature, i.e. pedagogically-oriented data, which can directly 
affect as well as inform recommendation process, thus, enhancing the quality of rec- 
ommendations. 



Table 1. A comparison of User A and B in e-learning and news story recommendation 





User A 


UserB 


Items to be recommended 


Research articles 


news 


Item value-ness (the most important 
from user’s perspective ) 


Interesting, understandable, 
useful 


interesting 


Prerequisite knowledge 


Yes 


No 


Item presentation order 


Yes 


No 



Therefore, it is obvious that making recommendations in our system is different 
from that in other domains where users’ interests were the most important in order to 
retain them by way of delivering personalized recommendations. Moreover, depends 
on the syllabus, items contained in a recommendation list might not be entirely inter- 
esting to learners, but sometimes all learners must read some of the items regardless 
of their interest (e.g. required reading materials). However, if the system continues to 
recommend something that cannot stimulate learners’ interest in one way or another, 
it may reduce learners’ learning performance which is also undesirable. Finally, for e- 
learning, customization should also be made not only of learning content, but also of 
the presentation style [3, 5]. This paper extends our previous work [11, 12], but 
mainly focuses on the value-ness of reading items for recommendation in an e- 
learning system through a human-subject experiment. 

The rest of this paper is arranged as follows. In the next section, we will dis- 
cuss some related work. In Section 3, we will present the technical aspects of our 
approach. Then we will report in detail the experiment we conducted as a way of 
assessing and comparing our proposed recommendation techniques. Finally, we 
conclude this paper by pointing out our future work. 



2 Related Work 

Paper Recommendations 

There are several related works concerning tracking and recommending technical 
papers. Basil et al. [1] define the paper recommendation problem as: “ Given a 
representation of my interests, find me relevant papers.” They study this issue in 
the context of assigning conference paper submissions to reviewing committee 
members. Bollacker et al. [2] refine CiteSeer, NEC’s digital library for scientific 
literature, through an automatic personalized paper-tracking module which re- 
trieves user interests from well-maintained heterogeneous user profiles. Woodruff 
et al. [13] discuss an enhanced digital book with a spreading-activation-geared 
mechanism to make customized recommendations for readers with different type 
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of background and knowledge. McNee et al. [6] investigate the adoption of col- 
laborative filtering techniques to recommend papers for researchers; however, the 
paper did not address the issue of how to recommend a research paper, rather, 
how to recommend additional references for a target research paper, in the con- 
text of an e-learning system, additional readings cannot be recommended purely 
through an analysis of the citation matrix of a target paper. Recker et al. [8] study 
the pedagogical characteristics of a web-based resource through Altered Vista, 
where teachers and learners can submit and review comments provided by learn- 
ers. However, although they emphasize the importance of the pedagogical fea- 
tures of these educational resources, they did not consider the pedagogical fea- 
tures in making recommendation. 

These works are different from ours in that we not only recommend papers ac- 
cording to learners’ interests, but also pick up those not-so-interesting-yet- 
pedagogically-suitable papers for them. In some cases pedagogically valuable 
papers might not be interesting and papers with significant influence on the re- 
search community might not be pedagogically suitable for learners. 

Document Value-ness 

The majority of scientific literature as well as other document retrieval systems (and 
many other systems) has been focusing on finding document relative to users’ inter- 
ests, to name a few [2, 6]. Recently, there have been approaches that augment the 
mostly commonly adopted similarity-based retrieving. Among them, Paepcke et al. 
[7] propose a context-aware content-based filtering. In particular, context-aware con- 
tent-based filtering attempts to determine the contextual information about a docu- 
ment, e.g. the publisher of the documents, the time when the document was published 
etc. For instance, they argued that ‘documents from the New York Times might be 
valued higher than other documents that appear in an unknown publication context’ . 
This contextual information provides additional rich information for users, thus, con- 
stitutes a very important aspect of the value-ness of the item. 

Our proposed approach takes into account of one type of contextual information: 
the pedagogical feature of learners. In particular, we argue that users’ pedagogical 
goal and interest should be regarded as two of the most critical considerations when 
we are making recommendations in e-learning systems. 



3 Pedagogically-Oriented Paper Recommendation 

Specifically, our goal can be stated as follows: 

Given a collection of papers and a learner’s profde, recommend and deliver a set 
of materials in a pedagogically appropriate sequence, so as to meet both the 
learner’s pedagogical needs and the learner’s interests. 

Ideally, the system will maximize a learner's utility such that the learner gains a 
maximum amount of useful knowledge and is well motivated in the end. Fig. 1 shows 
the flow diagram of the proposed recommendation system in our previous and current 
experiment. The uniqueness of our system is in the incorporation of artificial learners 
in order to solve the cold-start problem in collaborative filtering (CF). The following 
steps are the recommendation processes: 
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(T) A tutor manually assigns the properties of each paper (paper model). 

©) We create artificial learners (group A) with specific learner model. 

©) Model-based RS recommends a paper to the learners in group A by comparing 
learner model and paper model. 

(4) Learners in group A rate the paper. 

(5) We elicit human learners’ (group B) learner model. 

©) Hybrid-CF module recommends a paper to each learner in group B by compar- 
ing his/her learner model with those from group A (artificial learners) and 
searching paper with the highest rating. 

© After learners in group B read the paper, they rate the paper, which will be used 
in our analysis. 



Artificial Learners, Group A Human Learners, Group B 




Pedagogical Model-Based Recommendation Technique 

The model-based recommendation is achieved through a careful assessment and com- 
parison of both learner and paper characteristics. There will be two layers of filtering. 
More specifically, each individual learner models will first be analyzed, in terms of 
not only their interest, but their pedagogical features, such as their background 
knowledge in specific topics. Paper models will also be analyzed based on the topic, 
technical level, material covered and presentation. The recommendation is carried out 
by matching the learner interest with the paper topics where the technical level of the 
paper should not impede the learner in understanding it. Therefore, the suitability of a 
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paper toward a learner is calculated as the summation of the fitness of learner interest 
toward the paper and the easiness to understand the paper. 

Pedagogical Hybrid Collaborative Filtering Technique 

However, the layered model-based recommendation, which is achieved through a 
careful assessment of learner characteristics and then matches them against papers is 
very costly due to the following reasons: 

— when a new paper is introduced into the system, a detailed identification is 
required, which cannot be done automatically; 

— when a learner gains some new knowledge after reading a paper, a new match- 
ing process is required in order to find the next suitable paper for him/her, re- 
sulting in the updating of his/her learner model; 

— the matching between learner model and paper model may not be a one-to-one 
mapping, which increases the complexity of the computation. 

Alternatively, we can use a collaborative filtering technique (CF) to reduce the 
complexity of the recommendation process. The idea of CF is to pass on the ‘burden’ 
of ‘learning’ the features of a paper to learners by allowing peer learners (nearest 
neighbors) to filter out unsuitable papers. Therefore, the matching process is not per- 
formed from learner models to paper models, but from one learner model to other 
learner models, i.e. by comparing the closeness of both learners’ interest and back- 
ground knowledge in order to find nearest neighbors of a target learner. Therefore, 
papers are actually annotated by fellow learners themselves, while the system does 
not need to modify the features of papers, which greatly improve system performance 
and efficiency. Moreover, this type of the social filtering technique utilizes the net- 
work value of learners with similar interest (like those traditional recommender sys- 
tems do) and pedagogical characteristics. 



4 Experimental Results and Discussions 

In our experiments, we intend to answer some specific questions: when recom- 
mending articles to learners, what are the most important criteria for recommen- 
dation? Are a student’s learning interests more important than other aspects? Or 
their learning goal? Should we also recommend something which will aid their 
learning based on their pedagogical characteristics? 



4.1 Experiment Background and Setup 

The human subject study is conducted in a university in Hong Kong. The course is a 
senior level undergraduate course in software engineering, where the first author is 
the instructor of the course. There are altogether 48 students, and there are 23 candi- 
date papers in English related to software engineering and internet computing. Those 
23 papers are well selected from a pool of more than 40 papers originally selected for 
the course. Those papers are selected as part of the required reading materials for the 
group project in this course. The length of the papers varies from 2 pages to 15 pages. 
However, most of them are popular articles with low technical level that are suitable 
for those students. 
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For the purpose of testing, we first generate 50000 artificial learners (ArLs) 
(Group A in Fig. 1). Each ArL then rate those 23 papers according to their individual 
learner model (pure model-based). The rating mechanism was the same as what we 
reported in [12]. After that, we used human subjects as the target learners (Group B in 
Fig.l). Then, two cold-start recommendation techniques were applied for these target 
learners. The first technique used a hybrid-CF approach (as shown in Fig. 1), while 
the second used random assignment as the control [11], 



Learner Interest 



Learner Background Knowledge 



Software development 


f r 5 


r 4 


r 3 


r 2 r i 


Network security 


I r 5 r 4 


r 3 c 2 r i 


Web design and application 
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r 3 


r 2 r i 


Statistics 


r 


5 r 4 
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User interface design 
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r 2 r i 


Algorithm complexity analysis 
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r 3 r 2 r i | 


Recommender system 
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Discrete Mathematics 
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Marketing and management 
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Project management 
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Generate ArtLearner 



Fig. 2. Survey questions for target learner 



We first distributed 48 surveys asking about students’ interest, and their knowl- 
edge background (Fig. 2). The background knowledge consists of knowledge items 
students have learned in other courses, items which are also needed to understand the 
papers. After we received 41 feedbacks, we used the ratings by artificial learners and 
hybrid-CF to find the most suitable paper for each of them. Furthermore, we also 
select another paper randomly. Thus, each student was assigned to read two papers 
within five days and was required to give a pair of feedback forms, one for each paper 
(Fig. 3). The feedback form basically collected their subjective evaluation after read- 
ing the papers. In addition, we also asked them to write some critical comments about 
the papers, and this became an indicator of their seriousness in reading the papers. 
They were informed that adequately filling in their feedback from would give them a 
bonus mark. None of the learners knew that one of the papers was selected randomly, 
but they did know that they will receive personalized articles which can be used in 
their group projects (for this course). 



4.2 Experiment Results and Discussion 

In all, 24 pairs of valid feedback forms were received on time. 6 pairs of feedback 
form were received more than a week later, and 3 feedback forms were not valid, e.g. 
containing multiple answers, blanks, etc. 8 students did not return their feedback 
forms. In this paper, we only show the result of the 24 pairs of valid feedbacks. Fig. 4 
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shows partial result of them. The vertical axis of the diagram denotes the number of 
subjects who answer the respective question. The horizontal axis denotes the option of 
the answer given by the subject on a 4-point scale, e.g. 4 for “very”, 3 for “relatively”, 
2 for “not really”, and 1 for “not at all” as shown in Fig. 3. Moreover, the left bar 
contains the result for the recommended paper, while the right bar is for the randomly 
assigned paper. 



Ql 

Q2 

Q3 

Q4 

Q5 

Q6 

Q7 



1 . Docs the paper match your interest? 

4. very much 3. relatively 2. not rcallv l.notatal 



2. Is the paper difficult to understand? 

4. very difficult 3. difficult 



I . very easy 



3. Is the paper useful to your project? 

4. very much 3. relatively 2. not really 1. not at all 

4. Is the paper useful to aid your understanding of the SU concepts and techniques 

learned in the class? 

4. very much 3. relatively 2. not really I. not at all 

5. Would you recommend the paper to your fellow group members or other fellow 
classmates? 

4. absolutely 3. relatively 2. not really I. not at all 

6. Do you learn something "new” after reading this paper? 

4. absolutely 3. relatively 2. not really I . not at all 

7. What is your overall rating towards this paper? 

4. very good 3. good 2. relatively I. bad 



Please give several sentences of critical comments on the paper. 



Fig. 3. Learner Feedback Form 



Fig. 4(a) shows that learners were more interested in the recommended paper (left 
bars) than a randomly selected paper (right bars). For example, 6 subjects felt that the 
recommended papers are very interesting, while only 1 subject felt that the random 
paper was very interesting. Fig. 4(b) shows that learners felt that recommended papers 
were easier to understand than a randomly assigned one. This result conforms to our 
prediction, since the rating mechanism used by ArLs incorporates both learner interest 
and knowledge background in understanding a paper. Thus, a recommended paper 
generated by ArLs will also fit human interest and knowledge backgrounds. 

Fig. 4(c) shows the answer for the question whether the papers are useful to their 
class project or not. It can be seen from the result that most subjects admitted that the 
recommended papers were more useful than random papers. Fig. 4(d) shows that 
recommended papers were more likely to be recommended by learners to others. 
However, when we asked them whether they learned something new or not after read- 
ing the paper, it is not clear whether or not a recommended paper really gave them 
more new knowledge compared to a randomly-assigned paper does (see Fig. 4(e)). 
The result is not surprising: since all the candidate papers are well-selected for this 
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(a) (b) 




(c) 



(d) 




4 3 2 1 



Overall Rating 




(e) (f) 

Fig. 4. Results of learner feedback 



course, where the value-added for learners is high (most subjects felt that they learned 
a lot after reading those papers). Finally, Fig. 4(f) shows that recommended papers 
got higher ratings in terms of overall rating. 

From Fig. 4, it is ambiguous whether the overall ratings depend on learner interest 
or usefulness of the paper. It is also not clear how individual difference (orientation in 
giving the rating) affects the results, because the data only show the total number of 
subjects for each option. To overcome this problem, we analyze the ratings given by 
each subject on both recommended and randomly assigned papers. Specifically, we 
assign a numerical value to represent the difference between the ratings of both pa- 
pers. For instance, if a subject rates the recommended paper as “very interesting” (4) 
and the randomly assigned one as “relatively interesting” (3), then the difference is +1 





Laws of Attraction: In Search of Document Value-ness for Recommendation 277 

(4 - 3 = 1). However, if the subject rates the recommended paper as “not really use- 
ful” (2) and the randomly assigned one as “very useful” (4), then the difference is -2 
(2 - 4 = -2). Fig. 5 shows the distribution of these numerical differences. Except for 
the category “difficult to understand”, a positive value on the horizontal axis repre- 
sents that the recommended paper outperforms the randomly assigned one. Compared 
to Fig. 4, two results are worthwhile to be discussed here. First, in Fig. 5, the pattern 
of “overall rating” is similar to those of “useful to project” and “recommend to oth- 
ers”; while in Fig. 4, the pattern of “overall rating” is closer to those of “interesting 
article” and “useful to project”. Second, in Fig. 4, it is difficult to tell whether learners 
gain more new knowledge from recommended paper or not. But from Fig. 5, it ap- 
pears that on average, learners gain the same amount of new knowledge from both 
papers. 
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Fig. 5. The frequency of responses in terms of the differences between rating given to recom- 
mended and assigned papers 
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Table 2. The correlation matrix of the differences between recommended and randomly as- 
signed paper (for 7 different Questions in Fig. 3) 





Ql 


Q2 


Q3 


Q4 


Q5 


Q6 


Q7 


Ql 


1 














Q2 


-0.578 


1 












Q3 


0.535 


-0.312 


1 










Q4 


0.475 


-0.343 


0.552 


1 








Q5 


0.635 


-0.305 


0.653 


0.658 


1 






Q6 


0.373 


-0.127 


0.189 


0.205 


0.486 


1 




Q7 


0.616 


-0.332 


0.765 


0.517 


0.696 


0.346 


1 



Moreover, in order to find which factors affect the overall ratings, we calculate the 
correlation matrix of those individual differences (Table 2). From table 2, it is obvious 
that “overall rating” (Q7) is most positively correlated to “useful for project” (Q3), 
with correlation coefficient equal to 0.765. This suggests that learners weighted their 
learning goal more than their interest (Ql, correlation = 0.616). In other words, they 
appreciated and were willing to read not-interesting-yet-pedagogically-useful papers, 
which validates our conjecture. Since the group project was assigned according to 
student interest (they selected it from available topics), there must exist positive corre- 
lation between student interest (Ql) and usefulness of paper (Q3), which equals to 
0.535. 

In addition, we also studied the effect of ‘peer-to-peer’ recommendations made by 
each learner after s/he read the papers. Overall, the result is quite encouraging, show- 
ing that learners will tend to recommend those highly rated papers specially tailored 
for them to other similar learners (Q5, correlation coefficient = 0.696). 



4.3 Lessons Learned 

When we analyze the comments given by learners after they read the two recom- 
mended papers, we found out that the majority of learners have struggled to reach a 
‘ harmony ’ between their interest and educational goal: 

• They are still willing to accept interesting, yet pedagogically difficult paper 

• They are also willing to accept un-interesting, yet pedagogically helpful paper 

They admit that in order to acquire new knowledge for either their group project or 
long-term goal, they will tolerate the system to recommend those ‘unsuitable’ papers 
to them. For example, one of the students wrote in the feedback form as follows: 
‘ ..interesting article on..., but I find this article is quite difficult to me, too many tech- 
nical terms’ . Another wrote that “to be frank, the article is not likely to be what lam 
interested in, but, it contains some useful knowledge...’ . One similar comment was 
also made by another student that ‘ ..it is difficult for me to understand, yet the topic 
that the paper discussed is very interesting .’ 

Another interesting observation is that one critical features of the papers lies in its 
‘interestingness’. This is especially true in the case of the higher education in Hong 
Kong, since most of the students are ‘application-oriented’. Hence, one big challenge 
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for the RS is to retain learners and continuously engage them in the track of learning 
through the recommendation of ‘interesting’ papers. 

As specified in [4] that the bottom-line measure of recommender system success is 
user satisfaction. One possible way of measuring user satisfaction is not just to collect 
the information on the degree of how much they like the recommended items, but to 
ask users whether or not they will recommend the items received to other fellow users 
(Fig. 3). 

In general, conform to our hypothesis about the uniqueness of paper recommenda- 
tions for educational purpose, users’ pedagogical goal and interest have remained to 
be two of the most important features. How to balance between these two features 
will be one of the biggest challenges for research in the area. 



5 Conclusion 

In this paper, we explore the uniqueness of paper recommendation for e-learning 
systems through a human-subject study. Experiment results showed that the majority 
of learners have struggled to reach a ‘harmony’ between their interest and educational 
goal: they are willing to read non-interesting-yet-pedagogically-useful papers in order 
to acquire new knowledge for either their group project or their long-term goal. 
Hence, from this perspective, learners seem to be more tolerant than users in commer- 
cial RSs. Nevertheless, as educators, we should still maintain a balance of recom- 
mending interesting papers and pedagogically helpful ones in order to retain learners 
and continuously engage them throughout the learning process. This is especially true 
in our case, since most of the students in Hong Kong are more ‘application-oriented’. 

In the future, we plan to conduct a larger scale of experiments to explore the issue 
of ‘cross-recommendation’ where users will receive recommendation from both arti- 
ficial and human learners. 
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Abstract. MusicAustralia is a Web portal for anyone interested in Australian 
music. A joint development of the National Library of Australia and Screen- 
Sound Australia: National Screen and Sound Archive, it provides users with ac- 
cess to a federated resource discovery service for Australian music in notated 
and audio representations and in digital and non-digital formats, as well as a di- 
rectory service providing information on people, organisations and services as- 
sociated with Australian music. This paper outlines the architecture of the Mu- 
sicAustralia service, focusing particularly on its federated service model and the 
infrastructure elements and business processes developed to support this archi- 
tecture. It also looks at the way in which another major component of the feder- 
ated digital library of Australian music - the Peter Burgis Performing Arts Ar- 
chive - is using the MusicAustralia service model and architecture to shape its 
own strategies and structures. 



1 Introduction 

Ensuring access to music resources in the digital environment is a challenging task. 
Rapid developments in digital music technology, Web delivery of sound, score, image 
and multimedia, rights management and interactive technology systems are all chang- 
ing the face of music. This means that the ‘documents’ of musical culture are becom- 
ing more complex to capture, preserve, manage and deliver and necessitates stronger 
linkages between the musical world and the collecting institutions. 

The National Library of Australia is responsible for ensuring that documentary re- 
sources of national significance relating to Australia and the Australian people are 
collected, preserved and made accessible. Music resources have always been a major 
component of this mission. But in such a rapidly changing environment, how can 
long-term sustainable policy and planning for music resources best be developed at a 
national level? 

In a federated national, state and local government system, and with music docu- 
mented in different and complex formats, no single organisation can address these 
issues independently. Indeed, even the national music collection is primarily shared 
across two institutions. The National Library collects and preserves printed and 
manuscript music and personal papers, pictorial and ephemeral documentation, and 
oral history and archival sound recordings, mainly of folklore and vernacular mu- 
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sic. ScreenSound Australia: the National Screen and Sound Archive collects and pre- 
serves all commercially recorded music and its associated documentation. 

The National Library and ScreenSound Australia identified a strong case for de- 
veloping a national service for music which applied new technologies in an integrated 
and holistic way and embodied a national digital strategy for music. Such a strategy 
needed to address convergence of technologies and methodologies as well as interop- 
erability. The service would have to be cost-efficient, in order to encourage participa- 
tion and to generate content. It also needed to be built on collaboration and partner- 
ships across cultural institutions, the research and education sectors, and the music 
industry and communities. 

This vision of a national music service has now become reality in the form of 
MusicAustralia. [1] MusicAustralia is a new federated service developed jointly by 
the National Library and ScreenSound Australia, together with a range of content 
partners. Through a single Web interface, users can find comprehensive information 
about Australian contemporary and heritage music and music-related resources, in- 
cluding printed scores, sound recordings, manuscripts, texts, images, moving images, 
and Web sites. Users can also have access to digital musical materials in a variety of 
formats and from a range of institutions and sectors, though the Australian digital 
music collection is in its infancy and only a portion of the resources discoverable 
through MusicAustralia are available online. MusicAustralia also enables users to 
locate information about people, organisations, collections, activities and services. 



2 Federated Service Model 

The service model for MusicAustralia builds, in part, on the National Library’s previ- 
ous experience and success in developing PictureAustralia. [2,3] This federated ser- 
vice supports access to multiple collections of digitised images - more than 1 ,000,000 
in all. In PictureAustralia users ‘discover’ digitised images from 30 different institu- 
tions through a single Web interface, and then navigate via image thumbnails back to 
the home institution’s server, where viewing copies and further information are lo- 
cated. Aggregation of metadata for PictureAustralia is highly automated. The Na- 
tional Library uses the Open Archives Initiative Protocol for Metadata Harvesting 
(OAI-PMH) [4] to harvest descriptive metadata from contributing institutions, either 
from Web pages generated by local collection management systems or via a local 
OAI repository. Each institution must map and convert its own metadata schema to 
the simple Dublin Core [5] schema deployed in PictureAustralia. This architecture 
suits the Australian pictorial ‘scene’, in which a variety of descriptive record formats 
are deployed to describe image collections. 

The same basic approach has been applied in the design of MusicAustralia. De- 
scriptive metadata are harvested and aggregated at a national level, with links back to 
digital objects stored on the servers of contributing institutions. But a significantly 
greater level of complexity is imposed by the nature of music resources, both in terms 
of their description and in terms of their digital representations. Cooperative delivery, 
at a national level, of digital music objects and their associated metadata is a signifi- 
cant challenge. 

The MusicAustralia model builds on and aggregates existing processes, adapted to 
a Web environment. The key aim is to maximise the quantity of bibliographic cover- 
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age while supporting the creation and delivery of digital music objects. These objects 
- digitised and ‘born digital’ - are retrieved from multiple and disparate databases, 
which the user is able to interrogate and navigate. The service model aims to facilitate 
participation by smaller, specialist organisations and independent artists, as well as by 
large national institutions. 



3 Service Architecture - Resource Discovery 

MusicAustralia has adopted a service architecture for resource discovery based on 
Australia’s National Bibliographic Database (NBD). [6] The NBD is a national union 
catalogue with more than 14 million bibliographic records and 36 million holdings 
records. In 2003, the NBD held records in MARC format [7] for more than 55,000 
Australian printed music items, almost 30,000 Australian music sound recordings, and 
several hundred Australian manuscript music items. 

Nevertheless, significant portions of the Australian music corpus, especially com- 
mercially recorded music, were not represented in the NBD. These included the sub- 
stantial holdings of ScreenSound Australia, the Australian Music Centre, the National 
Archives of Australia and the Australian Broadcasting Corporation (ABC). Many of 
these organisations create rich and detailed records to describe their music holdings. 
ScreenSound Australia’s MAVIS [8] system, for instance, supports inclusion of both 
descriptive and collection management data. 

The business model of the NBD has been changing in recent years, in response to a 
growing need to harvest metadata in formats other than MARC and to integrate them 
into the NBD. Placing the NBD at the centre of resource metadata aggregation for 
MusicAustralia has benefited both the NBD and MusicAustralia. The coverage of 
music resources in the NBD has grown dramatically with the addition of records des- 
tined for MusicAustralia, and MusicAustralia has been able to import the large num- 
ber of existing records from the NBD. 

In addition to resource metadata, MusicAustralia also aggregates metadata about 
people and organisations (‘party metadata’) for single point access. The initial content 
for the party data within MusicAustralia was derived from the NBD’s Name Author- 
ity File, but this is being augmented with richer information from other sources. Insti- 
tutions contribute party metadata as a separate process. Party metadata are stored in a 
local schema which has been developed in the international context of MAPS and 
EAC [9], and is seen as the first stage of a generic Australian party schema. 



4 Business Processes for Resource Metadata 

In 2003 the National Library implemented a Harvester service as a generic framework 
for updating repositories with data gathered from various sources. This includes sup- 
port for the acquisition of ‘batch’ data as well as support for online data contribution. 
The Harvester batch system uses OAI or FTP to gather non-MARC data in standard 
formats and convert it for input to the NBD or another ‘downstream’ repository. The 
combination of the Harvester and the NBD in the MusicAustralia service model 
means that contributors can provide metadata in two different ways: as MARC re- 
cords directly to the NBD, or as non-MARC records via the Harvester. 
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Current MARC record contributors to the NBD can continue to add their music re- 
cords to the NBD by using the Kinetica cataloguing client or by using Kinetica 
Batch*Link from their local cataloguing system. [10] In both cases, MARC records 
are extracted from the NBD for the MusicAustralia database. Contributors of MARC 
records do not need to establish additional methods of presenting their data for inclu- 
sion in the MusicAustralia service. 




Resources ^ Party data 




Fig. 1 . MusicAustralia: Service Architecture and Business Processes for Resource Discovery 



4.1 Contributing Non-MARC Resource Records and Party Records 

Contributors of non-MARC records must present records extracted from their collec- 
tion management system to the Harvester, which performs the following processing 
stages. 

Harvest the Data. In many cases, contributors are establishing an organisational OAI 
repository to present their records for OAI harvesting. In some cases, contributors are 
choosing to FTP their records to the Harvester instead. The great advantage of the 
OAI method is that only new and/or updated records are presented, which signifi- 
cantly reduces processing loads. 

Format and Analyse the Data. In most cases, data are presented in the Metadata 
Object Description Schema (MODS), which the National Library has adopted as its 
exchange XML schema for Harvester purposes. MODS is richer than Dublin Core 
and more adaptable than MARC, and was developed with digital objects firmly in 
mind. The Library of Congress offers a number of stylesheet tools for converting data 
between Dublin Core, MODS and MARC. [11] 

MODS meets all identified MusicAustralia resource metadata needs, and offers 
significant advantages over Dublin Core, especially in its capacity to represent ‘con- 
stituent parts’. In some cases, however, data are harvested in Dublin Core or institu- 
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tion-specific formats. The Harvester must therefore recognise and invoke any contri- 
bution-based rules, including rules specifying automated conversion from one of these 
formats to MODS. 

Submit the Data for Further Processing. The Harvester separates new, updated and 
deleted resource records into different processing streams, converts them from MODS 
to MARC, and submits them to the NBD. This involves three steps: 

1. Using an XSLT [12] stylesheet (provided by the Library of Congress but amended 
to meet local conditions) to convert the MODS data to the MARCXML [13] 
schema; 

2. Using a MARCXML to MARC converter (provided by the Library of Congress but 
amended to overcome some difficulties with character encoding schemes) to con- 
vert the MARCXML data to the MARC binary format; and 

3. Passing the MARC data via FTP to the Kinetica Batch*Link service, which under- 
takes additional match and merge and data processing actions before loading the 
data into the NBD. 

In the case of party records, new, updated and deleted records are received by the 
Harvester (already converted via stylesheet processes to the MusicAustralia party 
schema) and passed directly to the MusicAustralia party database, which must di- 
rectly manage functions such as match, merge and error identification. It is possible 
that a more generalised, intermediate party database may be developed in the future, 
or that batch updating of the NBD Name Authority File will eventually be supported. 

The processes described above relate to batch contribution of data via the Har- 
vester. An online update and review component for the Harvester is currently being 
developed. This will allow data to be input or edited via a Web form, reviewed by 
repository administrators (or not, depending on business needs) and approved for 
passing to the NBD or another ‘downstream’ repository. 

It should be noted that the decision to pass resource data through the NBD does not 
impose any additional requirements on non-MARC MusicAustralia contributors. To 
contribute to MusicAustralia, contributors must be able to extract data from their own 
collection management systems, convert that data to an agreed exchange format (in 
this case, MODS) and present the data for harvesting. Each MusicAustralia contribu- 
tor is free to make its own decisions about what and how much of its data it wishes to 
expose. This includes the freedom to decide to expose all Australian music records, or 
only those that are associated with digital content. 



4.2 Extracting and Storing Resource Metadata 

MARC records are extracted from the NBD to MusicAustralia based on a combina- 
tion of criteria: material type, ‘Australian content’ flags, music-related classification 
numbers, physical description, and subject headings. Following the initial extraction 
of records from the NBD, MusicAustralia regularly polls the NBD for new and 
changed records. The MARC records extracted from the NBD are converted to 
MARCXML using a Library of Congress XSLT stylesheet and stored in the Musi- 
cAustralia resource database on the National Library’s TeraText [14] platform. 
MARCXML was chosen as the storage format for the resource database because it is 
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consistent with the work being done in the Kinetica redevelopment and is likely to 
simplify the extraction process from the NBD once this redevelopment is completed. 

While the Library of Congress had done considerable work on conversion from 
MARC to MODS, the National Library was the first institution to use MODS as an 
exchange format to incorporate non-MARC data into a MARC database. Various 
modifications to the MODS schema and to the MODS to MARCXML stylesheet were 
identified during the development of MusicAustralia. This process required a consid- 
erable investment of time and intellectual resources, but the lessons learned - and the 
opportunity to contribute to MODS schema and stylesheet development - outweighed 
the costs. The National Library now has a sound understanding of MODS, is confi- 
dent that it is an appropriate schema for data exchange and storage, and has the skills 
to assist contributors to convert and present their data in this schema. 



5 Service Architecture: Digital Objects 

While the initial focus of the development of the MusicAustralia service has been on 
resource discovery, a framework for the delivery of digital objects has also been es- 
tablished. As with the PictureAustralia service, the key principle is that the digital 
objects themselves are stored on the home institution’s server, with a link from the 
MusicAustralia metadata record. But music poses a series of different challenges. In 
the first place, PictureAustralia delivers a single object type - the image. MusicAus- 
tralia delivers digitised sheet music and larger scores, sound recordings, born digital 
scores and MIDI files, pictorial images, manuscripts, multimedia and text (books and 
theses). The single images in PictureAustralia are relatively simple to deliver, 
but MusicAustralia’ s digital objects are not so simple. Digitised scores and manu- 
scripts contain multiple images which must connect to form a whole. Born-digital 
scores pose significant challenges arising from the different forms of proprietary 
software used to create them. Sound recordings also raise a number of significant 
delivery issues, including whether to stream or not, how to integrate digital audio with 
meaningful bibliographic information, and how to deliver the files at resolutions 
which make for good listening but can still be available to the majority of Australians 
who do not have access to broadband services. 

Digitisation of music has proceeded slowly in Australia to date. By mid-2004, 
about 7,500 Australian scores were available online. More than 6,000 of these come 
from the National Library’s collections. The remainder come from five State libraries, 
and were digitized as part of a cooperative project supported by the National Library. 
Several thousand sound recordings have been digitized by ScreenSound Australia, the 
National Library, and some Australian universities. But only a fraction of these are 
currently available online, as a result of the complexities of clearing copyright and 
delivering derivatives of preservation masters for public access. Negotiations are in 
progress with rights management organizations to permit the delivery of in-copyright 
materials under a blanket licence. 

Links to digital audio files are invoked from icons appearing in the MusicAustralia 
metadata record. MusicAustralia does not prescribe how digital objects are stored, 
managed and delivered on participants’ servers, nor the metadata schema used to 
record information about the digital objects locally. But it does provide advice and 
guidelines for digitising audio files, both for preservation purposes and for online 
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delivery. These cover such issues as file types, connection and access methods, and 
quality. The guidelines for online delivery are aimed especially at enabling optimum 
access by a wide range of users across different types of network connections. They 
suggest that contributing institutions consider delivering audio files in two different 
formats, so that users can choose whichever is more appropriate for them. These may 
be of mid-range quality (for broadband connections) or low quality (for modem con- 
nections), depending on bit rates. Whether to provide downloading MP3 files or 
streaming files is largely a matter of local I.T. resources, since the latter require 
streaming server software. The National Library itself provides two streaming for- 
mats: RealAudio and QuickTime (at a minimum bit rate of 24kbps). 

Digitised scores are delivered initially as thumbnail images retrieved from the par- 
ticipant’s server and embedded in the MusicAustralia metadata record. MusicAustra- 
lia prescribes the size and format of these thumbnail images, but not of the full image 
which is displayed by clicking on the thumbnail. Thumbnail images should be 150 
pixels in their longest dimension, with the other dimension set to 150 pixels or less in 
order to maintain the aspect ratio of the image. The preferred format for these thumb- 
nails is JPEG. MusicAustralia recommends the creation of a ‘view copy’ and a larger 
‘examination copy’ of image files, as well as a PDF file containing a collated version 
of the entire item, for printing purposes. Satisfactorily delivering multi-page image 
files has been a major issue for the participants, with a variety of solutions being tried. 
New delivery software is likely to make this easier in the future. 

For text files, contributing organisations are encouraged to use a standardized for- 
mat such as XML for the preservation copy, encoded using a schema such as EAD or 
TEL A less satisfactory, but less resource-intensive alternative would be PDF. For 
online delivery, XML files should be converted to HTML for Web viewing, or trans- 
formed to PDF for printing or downloading the entire file. For ‘bom digital’ scores, 
the preference is for online delivery in HTML format, created from within the original 
software package (such as Finale or Sibelius). These files will still require the installa- 
tion of an appropriate browser plug-in. A PDF version is recommended where the 
original software does not offer the facility for HTML conversion. 



6 Using MusicAustralia 

Users of MusicAustralia can search at two different levels: a simple search across all 
creator, title, subject and date fields, and an advanced search of specific fields. The 
simple search resembles a Google-type search, and is designed to provide an easy 
entry into the service. The advanced search is designed for more sophisticated users 
and more complex queries. It can be limited to specific item types or to the collections 
ofspecific institutions, and can also be restricted to records with digital objects at- 
tached. Standard Boolean operators can be used to combine searches of different 
fields. Browsing of works by title, creator or date is also possible, as is browsing 
people by name. 

Search results are presented as a brief summary of each item, together with a link 
to more detailed information and related items, a link to information about the holding 
institution, and a thumbnail or icon representing a digital object (if available). Related 
items will include links to audio recordings related to sheet music (and vice versa), 
different versions of a piece of sheet music, and different audio recordings of the 
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same piece of music. Thumbnail images link to digitised sheet music, while appropri- 
ate icons link to audio files, digital scores, multimedia and digital texts. Audio files 
from a results set can be selected to form a playlist and played in sequence. 

A particular feature of MusicAustralia is its ability to enable simultaneous viewing 
of scores (or texts) and listening to audio files. Each item selected from the results 
summary will open a separate browser window. The user can choose a digitised score 
or text from the results summary, open a separate window for it, and then go back and 
choose an audio recording, which will appear in a third window. The audio file can 
then be played while viewing the score. Digital scores can also be combined into this 
process. 



7 MusicAustralia and the PASH Project 

In parallel with the development of MusicAustralia, a major project is underway to 
catalogue and digitise the contents of the Peter Burgis Performing Arts Archive, the 
most significant private collection of music and music-related resources in Australia, 
recently acquired by the University of Western Australia. [15] The Burgis Archive 
covers the work of Australian musicians and composers in all musical fields over the 
last hundred years. It encompasses Western art music, blues, jazz, country and west- 
ern, folk, pop and rock, ethnic music, indigenous music, music theatre, radio shows 
and advertisements, recordings of historic events and oral history. In all, it contains 
almost 200,000 individual items: 112,000 sound carriers and 72,000 print documents. 
The sound recordings include rare cylinder recordings (3,000), Edison discs (2,200) 
and 78s (64,000). The print library includes posters, photographs, concert pro- 
grammes, sheet music and biographical files. The value of the Archive for research 
into the history of musical composition and performance and the history of entertain- 
ment is immense. Cataloguing and digitising the Burgis Archive is the major compo- 
nent in Preserving Australia’s Sound Heritage (PASH), a national project which also 
covers the Australian Archive of Jewish Music held at Monash University. 

The Burgis Archive will form a major component of the national digital library of 
Australian music. It is essential that the service model and architecture developed for 
this collection are closely integrated with the overarching approach of MusicAustra- 
lia. The PASH project is uniquely placed to take advantage of the framework laid 
down by MusicAustralia, since no retrospective conversion of metadata is required 
and the resource discovery architecture for PASH is being developed from scratch. 

The PASH project is developing strategies in two main areas: resource discovery, 
and digital content management. As far as resource discovery is concerned, the over- 
riding strategic aim is to ensure that resource metadata from PASH can be integrated 
into MusicAustralia. The project is currently carrying out a comparative evaluation of 
MARC cataloguing and resource description based on metadata schemas like MODS. 
A major consideration is the level and nature of training and expertise required. 
MARC-based music cataloguing is particularly specialised and time-consuming, on 
the one hand, but MODS-based resource description is a new field with little in the 
way of precedents or expertise. 

If the MODS approach is followed, PASH will need to establish its own method 
for exposing its metadata to the National Library’s Harvester service for inclusion in 
MusicAustralia. If MARC is used, records can either be created directly in the NBD 
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using the Kinetica Cataloguing Client or created locally and uploaded using Ki- 
netica’s Batch*Link service. One advantage of direct cataloguing would be the ability 
to reuse existing catalogue records from the NBD for items also held in other collec- 
tions, but the proportion of unique material in the Burgis Archive is relatively high. 

The extent to which the PASH project will create, contribute and use ‘party meta- 
data’ for composers’ and performers’ names is also being analysed. In large part, this 
is dependent on the decisions made about local creation and direct Web availability of 
resource metadata. 

Compatibility with MusicAustralia in digital content management is also an impor- 
tant consideration for the PASH project, though direct integration is not a requirement 
here. PASH intends to establish its own digital repository, initially to hold digitised 
versions of selected sound recordings, with priority in the digitisation process is being 
given to the most rare and fragile items in the Burgis Archive. The different delivery 
and management systems being used by MusicAustralia contributors for their digital 
objects are being evaluated by the PASH project to ensure that, as far as possible, it 
identifies and adopts national best practice in this field. MusicAustralia also specifies 
its own preferred formats for image and audio files, and the PASH project will aim to 
comply with these. 

One area of particular interest is the possible applicability of software solutions for 
institutional research repositories, such as DSpace and Fedora. These are currently the 
subject of national investigation by two projects funded by the Australian Research 
Information Infrastructure Committee (ARIIC): Australian Research Repositories 
Online to the World (Project ARROW) [16] and the Australian Partnership for Sus- 
tainable Repositories. [17] The National Library is a partner in both these projects. 
The PASH project is investigating the use of this kind of repository to store and de- 
liver digitised music files. 



8 Related Work 

As part of its planning for the MusicAustralia service, the National Library has moni- 
tored a range of international digital music initiatives. These have included digitisa- 
tion projects, Web gateways, and technical investigations. [18] 

Several significant music digitisation projects have been undertaken in North 
America. These have generally focused on sheet music, notably the Lester Levy and 
Duke University digital sheet-music collections, with their impressive descriptive 
analysis and depth of functionality. Canada’s Virtual Gramophone service is a major 
audio digitization project, focusing on 78s, while Jukebox was a European pilot pro- 
ject for access to distributed collections of archival audio files. 

Various music gateway services have also been developed in Europe and North 
America. They include collection-level descriptions (e.g. Cecilia), union catalogues 
(e.g. Ensemble and Music Libraries Online) and information about musical activities 
(e.g. European Musical Navigator). Other projects, such as Variations2 and 
WEDELMUSIC, have addressed the technical requirements for Web-based delivery 
of digital music objects, and associated issues in music information retrieval. 

MusicAustralia is unique in the way it combines various current strands of interna- 
tional digital initiatives in music. It combines a national framework, a union cata- 
logue, and innovative approaches to metadata harvesting and aggregation. It also 
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provides a Web gateway to distributed collections of digital objects, representing 
various different musical formats - sheet music, audio files, multimedia, and digital 
scores. 



9 Conclusion 

MusicAustralia has laid the foundations for a national digital library of Australian 
music and has provided the basis for future infrastructure development in this key 
subject area. For closely related national initiatives like the PASH project, it is crucial 
to ensure that their strategies and outcomes are compatible with MusicAustralia, at the 
very least, and are integrated with it wherever possible. 

As with the national pictorial collection, the national music collection has tradi- 
tionally been split between sectors which have not shared information. Users have 
therefore been unable to use a single service to discover the location of music materi- 
als - even such closely related materials as a score and an associated recorded per- 
formance of an Australian work. In addressing this need, MusicAustralia has not only 
created an innovative resource discovery service; it is concurrently encouraging and 
supporting the creation of the national digital music collection. 

Its specific benefits and achievements already include: 

• Using the existing infrastructure of the NBD to handle contributor business proc- 
esses, both offline and online, as well as record de-duplication processes; 

• Reducing costs to MARC -based MusicAustralia contributors, by avoiding the need 
to support two separate processes for generating resource metadata; 

• Increasing the representation of Australian music records in the NBD by incorpo- 
rating records from organisations and sectors not currently contributing to it; 

• Testing and developing large-scale processes for harvesting, gathering and convert- 
ing data in a range of non-MARC formats. 

A key challenge is to broaden data contribution options and therefore the data con- 
tributor profile, while retaining data quality and consistency. Including records from 
organisations which may not have as much expertise as the MARC community in 
creating and managing descriptive metadata may mean some compromises on data 
quality. But such data quality issues are probably of more concern to libraries than 
they are to end users. End users, especially those who ‘discover’ materials through a 
specialist service such as MusicAustralia, tend to use simple search terms and their 
discovery needs can largely be met through relatively simple records. 

Several key goals for the further development of MusicAustralia in the medium term 
have been identified. They include: 

• Encouraging the creation and contribution of a greater level of digital content, 
especially through the selective digitization of score- and sound-based music re- 
sources. The PASH project will serve as an exemplar for this process. 

• Developing infrastructure to accept, manage and deliver knowledge annotations 
contributed by users of the service, and using these annotations to improve re- 
source descriptions. 

• Developing an authoritative directory of biographical and organizational informa- 
tion about Australian composers, performers and music organizations. 



Sound Footings: Building a National Digital Library of Australian Music 291 



• Providing seamless integration between the resource description, biographical and 
annotation components of the service. 

• Positioning the service to take advantage of likely developments in music informa- 
tion retrieval systems and the online delivery of commercial recordings. 

The launch of MusicAustralia in 2004 is a major cooperative achievement, and has 
provided a firm foundation for the continuing development of a national digital li- 
brary of Australian music. 
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Abstract. MiDiLiB is a six year research project on digital music li- 
braries funded by the German Research Foundation (DFG) as a part 
of the Distributed Processing and Delivery of Digital Documents (V 3 D 2 ) 
research initiative. MiDiLiB’s main focus is the development of content- 
based retrieval algorithms for both score- and waveform-based music. In 
this paper we give an overview of our research results, describe several 
prototypical systems for content-based music retrieval which have been 
developed during the project, and discuss applications of the presented 
techniques in the context of today’s and future digital music libraries. 



1 Introduction 

During the course of the development of digital libraries for non-textual or non- 
standard document types, the last five years have seen increasing efforts in the 
field of music libraries 1 . Problems arising from the task of handling non-standard 
document types in digital libraries are manifold. A rough and non-exhaustive life- 
cycle of a non-standard document within a digital library may include the stages 
of digitization, choice of a suitable data format, transfer to and registration with 
the library. Furthermore, one has to deal with the issues of content-analysis 
(and annotation) as well as the generation of classical and content-based index- 
structures for efficient document access and retrieval. Finally, the creation of 
(multimodal) user interfaces, usage of system independent mechanisms for long- 
term document storage, and development of novel services for end-users to access 
the documents are of fundamental importance within a library scenario. 

Among those tasks, content-based document analysis and retrieval is one 
of the most challenging problems. In content-based document processing, raw 
data contained in a document are processed directly, rather than relying on 
secondary document descriptions such as annotated metadata related to the 
document. With regard to the huge existing collections of digital documents, 
efficient mechanisms for content-based document analysis are of fundamental 

1 To avoid confusion, in this paper we shall use music as a general term comprising 
score- and digital waveform-based data. The term audio denotes digital waveform- 
based data like CD-audio or radio broadcast signals. 
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importance, in particular as generally the manual creation of secondary docu- 
ment descriptions is unfeasible. In content-based retrieval, queries to document 
collections are processed based on suitable index structures derived from an 
automatic content analysis. 

One of the earliest and most intuitive tasks in content-based music retrieval 
is the name-that-tune application, where a user is interested in finding the ti- 
tle of a tune or a song which has been broadcast on the radio. Frequently, a 
listener is familiar with parts of the main theme or the hook line of the song 
although he might not remember the composer or interpreter. In order to find 
out about such kind of information (metadata), one could call the radio station 
or, provided availability, investigate the station’s playlist on the internet. In case 
none of these alternatives is available, a comfortable solution could be to hum 
or whistle the tune in question into a microphone and let a computer do the 
work of finding the desired information. For this purpose, the whistled tune is 
converted into a suitable sequence q of notes and then compared to a collection 
mi,...,TOjv of melodies which are used as a reference database. All melodies 
which are close to q with respect to a suitable distance measure are returned as 
query results. Music search based on note-representations is commonly refered 
to as score-based retrieval. One of the pioneering works in this field [1] is based on 
transforming query and database melodies into so called down-up-repeat (DUR, 
also known as Parson’s code) sequences, for roughly representing pitch intervals 
between subsequent notes. Such sequences provide a certain robustness against 
query errors with respect to musical intervals and absolute pitches of queried 
notes. The book of Barlow and Morgenstern [2] is one of the first content-based 
music dictionaries for manual search. It contains key-normalized pitch sequences 
representing the introductory parts of a large number of classical pieces. 

Although the previous example as well as the sketched solution sound intu- 
itive, real score-based retrieval scenarios impose several fundamental problems 
such as developing methods and user interfaces for query formulation, facilitating 
fault-tolerant queries, or devising index structures for efficient query processing. 
While the scenario of melody-based search allows us to use well-known data 
structures and algorithms form the field of text retrieval, the search in collec- 
tions of complex, polyphonic musical scores requires data models and retrieval 
algorithms which are better adapted to music data. When the underlying music 
documents consist of audio signals as, e.g., CD-audio, content-based retrieval re- 
quires methods from digital signal processing. Important research topics in the 
field of content-based audio retrieval are audio identification, genre classification, 
and recommendation (“customers who bought song a also bought song 6”). 

By now, the community of researchers specializing in music information re- 
trieval (MIR) has grown to a considerable size, which is documented by the suc- 
cess of the the annual International Conferences on Music Information Retrieval 
(ISMIR) bringing together music researchers, audio engineers, computer scien- 
tists, librarians, and music industry [3]. The recently finished MiDiLiB-project, 
which has been funded by the German Research Foundation (DFG) as a part of 
the Distributed Processing and Delivery of Digital Documents (V 3 D 2 ) research 
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initiative, has contributed to several of the recent advances in content-based mu- 
sic retrieval. In this paper, we outline the technical concepts on music retrieval 
developed by the MiDiLiB-project. Besides developing techniques for content- 
based indexing and search of music documents this includes concepts for the 
important tasks of audio monitoring and synchronization of musical documents 
in different formats. We present several of our prototypical systems for efficient 
music retrieval and give an overview on our test results. Taking into account re- 
lated work, we outline current issues in music retrieval and discuss future trends 
in the area of digital music libraries. 

Concerning content-based music retrieval, MiDiLiB’s main contributions are 
the development of data structures and efficient, fault tolerant techniques for 
polyphonic search in collections of polyphonic scores [4]. Besides for the first 
time allowing efficient search in polyphonic scores, a particular strength of the 
proposed techniques is a natural mechanism for finding and precisely localizing 
partial matches, the latter accounting for possibly incomplete user knowledge. 
Extensions of our technique have been successfully applied to the problems of 
fast audio identification [5]. As a generalization, we developed a technique for 
content-based search in large classes of multimedia documents including digital 
2D-images, shapes, and 3D-models [6]. 

The paper is organized as follows. In Sections 2-4, we give an overview on 
our techniques for content-based music retrieval on different types of music doc- 
uments. Section 2 deals with the task of searching in scores of polyphonic music. 
As a second task, in Section 3 we discuss melody-based retrieval, which may 
be considered as the monophonic version of the latter (general) score-based re- 
trieval. However, when allowing vague user queries such as melodies whistled into 
a microphone, special care has to be taken to develop suitable mechanisms for 
incorporating fault tolerance. Sections 4 sketches how the developed techniques 
may be extended to search in large collections of audio signals. In particular, we 
describe a technique for identifying short excerpts of audio signals and present 
some recent applications. Finally, Section 5 discusses techniques for synchroniz- 
ing music documents given in different data formats (like, e.g., score- or signal- 
based formats) and outlines the importance of such algorithms in digital library 
scenarios. Concluding, Section 6 discusses possible applications of the retrieval 
techniques in the context of future digital music libraries. 



2 Score-Based Retrieval 

One of the key problems in searching scores of polyphonic music by content 
is to find appropriate models for representing the score data. In a classical ap- 
proach, which has already been sketched in the introduction, a sequence of notes 
is simply modeled as a string of symbols representing the notes’ pitches. For ex- 
ample, the melody Twinkle twinkle little star would be represented by the string 
ccggaagf f eedc. In a score-based retrieval scenario, a collection of N melodies 
would be modeled by a database of strings mi , . . . , m n- Likewise, a user’s query 
is represented by a string q. For query processing, q would be compared to all 




Content-Based Retrieval in Digital Music Libraries 



295 



melody strings rri, using a suitable similarity measure d. A simple similarity mea- 
sure is given by the edit distance d{A, B) between two strings A and B, i.e., the 
minimum number of edit operations required to transform string A into string 
B. The three classical edit operations are insertion, deletion, and replacement of 
individual symbols. 

It turns out that when considering polyphonic music, the classical string- 
based approach is no longer feasible. One of the main reasons is that simulta- 
neous notes cannot be modeled appropriately. Also note durations and rhyth- 
mic behaviour are not modeled by the above string-based representation. A 
main problem with the string-based approach is the treatment of the notes’ 
onset positions which are represented only implicitly, i.e., by their respective 
position within the melody string. In [4] we present a framework for modelling 
polyphonic music where note onset positions are made explicit. In this, a note 
is a pair [t,p] consisting of a pitch p and an onset position t. Then, a score- 
based music document may be easily modelled as a set of notes. For example 
D i := {[10, c 1 ], [10, e 1 ], [14, c 2 ]} is a music document consisting of three notes, 
two simultaneous notes of pitches c 1 and e 1 played at time position 10 and one 
note of pitch c 2 played at time 14. The set of all possible notes will hence be de- 
noted by U :=Zx {c, d, e, . . .}, i.e., each note has an integer time-component 
(Z) and a certain pitch (c 1 , d 1 , e 1 , . . .). We consider a database D i, . . . , Dn where 
each document Di is simply a set of notes, i.e., Di C U . Exact matches for a 
query Q C U may then be easily defined by requiring that a time-shifted version 
Q + t of Q occurs in a document Dj. As a toy example, consider the query 
Q := {[5, c 1 ], [9, c 2 ]}. As Q + 5 := {[5 + 5, c 1 ], [9 + 5,c 2 ]} = {[10, c 1 ], [14, c 2 ]} is 
contained in the above document D\, the pair (5, 1) will be called a match, the 
number 1 specifying the matching document and the number 5 representing the 
time-lag required to shift the query to the matching position. 




Fig. 1 . Query to a database in the piano roll notation (left) and excerpt J.S. Bach’s 
Fugue in C major, BWV 846, where all occurrences of the query are highlighted (right). 

The proposed idea may be easily extended to other types of matches. As an 
example, assume that instead of using symbols c , dr, e, . . ., pitches are modelled 
using integers 0,1,2,.... Then we could define a match to be a triple (r, p, i) 
satistfying Q + (r, n) C Di, i.e., the query Q time-shifted by r and pitch-shifted 
by 7 r semitones occurs in document Di . Score-based music is typically visualized 
using the so called piano-roll notation, where each note is represented by a 
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rectangle located at position [t,p\ corresponding to its pitch p and onset-time t. 
The width of a rectangle is proportional to the corresponding note’s duration. 
As an example, Fig. 1 (left) shows the piano-roll representation of a small query 
document. To the right, all pitch- and time-shifted occurrences of the query 
within an excerpt of J.S. Bach’s Fugue in C major, BWV 846, are highlighted. 
Further types of matches are necessary when considering fault-tolerant search. 
An important example is that a few, say k, notes of a query Q do not match a 
target position Q + t within document D;. To account for this, we say that (f, i) 
is a hit with k mismatches. Further types of fault tolerance including a concept 
of fuzzy notes as well as a mechanism for incorporating a user’s prior knowledge 
on certain aspects of the desired document are discussed in [4]. 

Efficient index-based algorithms for the proposed types of matches have been 
devised and successfully tested on a database of 12,000 pieces of music contain- 
ing about 33 million notes. The algorithms are based on a modified version of 
inverted files, which are well-known from classical full-text retrieval. Intuitively, 
for a given database T> = (D i, . . . , Dat), one inverted file Ht>{p) is created for 
each pitch p. If a note [t,p] occurs in piece Dj, an object (t,i) is included in 
the inverted file Hx>(p). Together, the inverted files form an inverted index. In 
our PROMS system [4], exact queries consisting of some 10-100 notes can be 
answered in about 50 milliseconds on a Pentium II, 300 MHz PC. The underly- 
ing database consists of music given in the popular score-like MIDI format [7]. 
Queries may be specified, e.g., by using an integrated piano-roll editor or by 
recording a piece of music using a MIDI-piano connected to the system. 

We summarize some related work in the field of polyphonic score retrieval. 
Lemstrom et al. recently considered several retrieval tasks using a data model 
which is very similar to our approach [8] . Doraisamy et al. [9] model polyphonic 
scores based on n-grams and use standard database techniques for query pro- 
cessing. Pickens et al. [10] consider polyphonic score-based retrieval where the 
query consists of an audio signal. For this purpose, the query is first transformed 
into an approximate score version and then processed further. A very difficult 
and yet unsolved aspect of score-based retrieval is similarity-based search. Typke 
et al. use the Earth Movers Distance in combination with a set-based data model 
to incorporate a similarity measure within an efficient retrieval algorithm [11]. 



3 Melody-Based Retrieval 

In a melody-based retrieval scenario we assume that a melody, i.e., a monophonic 
sequence of notes, is used for querying a database of melodies. Typically, a 
melody-based retrieval system allows queries to be formulated by humming, 
singing, or whistling a tune into a microphone. Hence such a system is targeted 
to a much broader class of users than a system for polyphonic retrieval. 

The above set-based approach may be naturally extended to handle melody 
queries. However, due to the differences in the general query scenario, much more 
fault-tolerance is needed to successfully process a query. First, the hummed or 
whistled query is is transformed into a sequence of notes using a suitable ex- 
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Fig. 2. Overview of the NWO system architecture. 



traction algorithm. Besides note extraction being an error-prone task in itself, 
the users’s input may contain wrong notes, missing notes, deviations in rhythm 
and tempo, or the query may be formulated in the wrong key. Taking into ac- 
count such sources of errors, our set-based approach to score-retrieval has been 
extended by a tempo-tracking mechanism as well as a technique to account for 
missing notes. The tempo-tracker is used to account for the typical tempo vari- 
ations occurring in many hummed or whistled queries. For this purpose, during 
query processing each match candidate is assigned a tempo-tracking parameter. 
In each step of the retrieval algorithm, a limited variation of this parameter 
(w.r.t. the queries’ tempo curve) is allowed, otherwise a match candidate is 
excluded from further processing. In [12], we present the NWO-system (notify!- 
by-whistling online) for recognizing tunes whistled into a microphone. Fig. 2 
shows an overview of the NWO system’s architecture. Following the extraction 
step, the extracted notes are presented to the user in a piano-roll representation. 
The user then may verify his query by acoustic playback and make corrections 
to the extracted notes. Before actually issuing a query, the user is allowed to 
incorporate his prior knowledge about the desired query result by specifying pa- 
rameters like rhythm tolerance or the maximum number of missing notes. After 
querying the user is allowed to copy melody fragments of particular query results 
for reusing them as new queries. Hence a special type of relevance-feedback is 
possible. An online version of NWO working on a manually compiled (and hence 
limited) database of about 2,000 melodies is currently made available. 

It turns out [12] that our set-based approach for tempo-tracking shows signif- 
icant advantages when users are able to remember rythmic and harmonic details 
of the query item. In such a case our search algorithms, in contrast to the classi- 
cal edit-distance based approaches, only retrieve the few relevant database items 
as query results. This holds even if a query consist of only a few, say 4-6, notes. 
If, on the other hand, queries are of low quality, an edit distance- (or gener- 
ally string-) based retrieval approach usually yields better results. However, this 
comes at the expense of longer result lists and longer required query lengths. 

In the field of melody-based retrieval, a significant amount of research has 
been done during the last decade. Besides the pioneering work carried out in the 
New Zealand Digital Library project [1] we only mention two recent contribu- 
tions. First, the Cuby-Hum system incorporates many of the essential technical 
aspects of a state-of-the-art query-by-humming system. The paper [13] is thus a 
good starting point for further reading. As a second aspect, a robust extraction of 
note events from user queries is crucial for obtaining high quality queries. In con- 
nection with the musicline.de database project, a query-by-humming technique 
has been developed which uses a physiological model for pitch extraction [14]. 
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4 Audio Retrieval 

In audio retrieval, rather than working on high level musical features such as 
notes, retrieval is performed based on the digital waveform signals of the under- 
lying music. An early approach for classifying sounds according to characteristic 
acoustic and perceptual features has been proposed by Wold et al. [15]. In this 
approach, short acoustic fragments constituting similar sounds are grouped to 
clusters, examples being clusters for laughter, scratchy sounds, or barking dogs. 

An important problem in audio retrieval is the identification of audio signals. 
Given a large database of known audio signals, the latter may be regarded as 
a retrieval problem. Typically, a query signal will be a short and probably very 
noisy excerpt of an original signal. A recently very popular application scenario 
consists of recording a part of an unknown song using a mobile phone. Such a 
scenario could take place in a car, a restaurant, or some other noisy environ- 
ment. The recorded song is then transmitted to an identification agency. After 
successful identification, the user is provided with the song’s title, composer, 
interpreter, and possibly ordering information for the corresponding CD. 




Fig. 3. Waveforms and feature representations extracted from a query signal q (top) 
and a database signal Xi (bottom). A t-shifted version of the query signals’ feature 
representation F[q] occurs in F[xi\. 

A main problem in audio retrieval are the large volumes of data which have to 
be handled. However, it turns out that the set-based approach to score-retrieval 
presented in Section 2 may also be used to obtain efficient algorithms for audio 
identification. Fig. 3 gives an overview on the underlying concepts, where the 
basic idea consists of converting the huge number of sample values constituting 
an audio signal to a so-called feature representation. Indexing and searching is 
then performed based on those feature representations. Feature representations 
are obtained using a so-called feature extractor F, which assigns class labels c 
from a set X of available feature classes to signal positions t within an audio 
signal. A feature f = [f , c] £ Z x X is a pair consisting of a time position t 
and a class label c assigned to this position. Fig. 3, shows a feature extractor 
F extracting significant local maxima and minima from an input signal. In this 
case, X = {x, •}, where feature class • denotes essential local maxima and class x 
denotes essential local minima. The figure shows a query signal q and a database 
signal Xi which are processed by F. The upper part shows the waveform signal 
q and the feature representation F[q\. Note that the the extracted features are 
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plotted at their respective sample positions. Hence, for the given signal q , the 
feature representation F[q] contains three elements. The lower part of Fig. 3 
shows the corresponding data for the signal Xi . To illustrate our concept of 
feature-based identification, we note that a f-shifted version of the query signal q 
occurs in the database signal Xi . This in turn is reflected by the t-shifted feature 
representation F[q] matching a subset of the feature representation F[xt]. As 
compared to the above technique for score-based search, the notes [t,p\ are just 
replaced by features [t, c], hence making the full index-based retrieval technique 
available. Thus for audio indexing, instead of creating inverted files for each 
pitch, the search index consists of one inverted file for each feature class. 

To give an impression of resulting index sizes, we mention one of our test 
collections consisting of 15 GB of uncompressed audio which has been indexed 
using several different feature extractors. The resulting index sizes range from 33 
to 128 MB amounting to averages of 85-420 features per second. For more test 
results and detailed information on the various feature- and application-settings 
as well as our prototypical audentify!- system, we refer to [16, 17]. 

As a second retrieval task we briefly discuss audio monitoring. Monitoring 
applications generally deal with the detection or inference of certain events from 
particular real-time data streams and have recently gained a great deal of atten- 
tion, particularly in the fields of databases and information retrieval. When mon- 
itoring real time audio streams such as broadcast channels (e.g., radio or TV), 
one is interested in detecting occurrences of known audio fragments or, more 
generally, particular acoustic events within those streams. To illustrate how au- 
dio monitoring may be performed using our retrieval technique, we assume that 
a collection of radio commercials is given as a database. In a preprocessing step, 
a search index is built from this collection as described above. During the process 
of monitoring, a computer receives a radio program via cable or network con- 
nection. Subsequently, the incoming audio signal is transformed into a feature 
representation. After a fixed number of features have been extracted, we use this 
so-called feature segment as a query to the search index. By storing the query 
results of each feature segment in an appropriate data structure, it is possible 
to precisely transcribe which commercial of the database has been broadcast at 
what time. Applications of such a scenario include automatical creation of ad- 
vertising statistics, or, using a search index built from a large collection of music 
signals, automatic generation of playlists. For a description of our prototypical 
monitoring systems audentify! -live and Sentinel we refer to [5,18]. 

We briefly summarize related work on audio retrieval. An early approach 
to recognize distorted musical recordings was proposed in [19]. In the context 
of large data collections, algorithms for robust audio identification include au- 
dio hashing [20], geometric hashing-based approaches as proposed by Wang et 
al. [21], hidden markov models (HMMs) [22], or clustering-based approaches [23]. 
An overview on the proposed techniques may be found in [24]. Only recently, 
audio identification services for the above mobile phone scenario have been 
launched in several parts of Europe as, e.g., the service offered by Slrazam En- 
tertainment. Future work will also be concerned with using compact feature 
representations of audio signals for exchanging content-based information [25]. 
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5 Music Synchronization 

Modern digital music libraries contain textual, visual, and audio data. Among 
this multi-media based information, musical data poses many problems, for mu- 
sical information is represented in many different data formats which, depending 
upon the application, fundamentally differ in their respective structure and con- 
tent. So far we have encountered two such data formats: the score data format 
and the digital waveform-based data format which we simply referred to as audio. 
Score data roughly describes music in a formal language depicted in a graphical- 
textual form, whereas audio data encodes all information needed to reproduce 
an acoustic realization of a specific musical interpretation. Other data formats 
such as MIDI may be thought of as a hybrid of the score and audio data format. 
In MIDI, relevant content-based information such as the notes of a score as well 
as agogic and dynamic niceties of a specific interpretation can be encoded. 

Hence, a musical work in the digital context is far from being unique since it 
can have several different realizations in several different formats. This hetero- 
geneity makes content-based browsing and retrieval in digital musical libraries 
a challenging tasks. For example, one may think of a user who tries to find a 
specific passage in some audio CD but only roughly knows the melody or only 
remembers some score-based information such as a configuration of certain notes. 

One important step towards a solution is the synchronization of multiple 
information sets related to a single piece of music. In the audio framework, 
by synchronization we denote a procedure which, for a given position in one 
representation of a piece of music, determines the corresponding position within 
another representation (e.g., the coordination of score symbols with audio data). 
Such linking structures could extend score-based music-retrieval to facilitate ac- 
cess to a suitable audio CD and could assist content-based retrieval in heteroge- 
nous digital music libraries - for example, allowing melody-based retrieval in 
the audio scenario and vice versa. Furthermore, linking of score and audio data 
could be useful for automatic tracking of the score position in a performance or 
for the investigation of tempo studies. 

Within the MiDiLiB-project, we designed and implemented algorithms for the 
automatic synchronization of score-, MIDI- and audio-data streams represent- 
ing the same piece of music [26]. To align, for example, an audio data stream 
with a score data stream, we first extract score-like parameters such as onset 
times and pitches from the audio data stream. Then the actual alignment is 
computed based on the score-parameters by a technique similar to the classical 
dynamic time warping (DTW) approach. Only recently, two similar DTW-based 
synchronization algorithms have been proposed: Turetsky et al. [27] first convert 
the score-data stream into an audio-data stream using a suitable synthesizer 
and perform the alignment in the audio domain. Soulez et al. [28] use the score 
data to design a sequence of suitable filter models which can then be compared 
with the audio data stream. In contrast to these two approaches, we perform the 
synchronization purely in the score-like domain which has advantages in view of 
both efficiency and accuracy. 
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However, due to the complexity and diversity of music data the problem of 
automatic music alignment is still far from being solved - not only concerning 
the data format but also concerning the genre (e.g., pop music, classical music, 
jazz), the instrumentation (e.g., orchestra, piano, drums, voice), and many other 
parameters (e.g., dynamics, tempo, or timbre). For the future it seems promising 
to devise a system incorporating multiple competing strategies (instead of relying 
on one single strategy) in combination with statistical methods as well as explicit 
instrument models in order to cope with the richness and variety of music. 

6 Conclusions and Future Work 

In this paper, we discussed recent advances in the field of digital music libraries. 
We focused on the important aspect of content-based retrieval and gave an 
overview on some of the techniques which have been developed in our MiDiLiB- 
project. More precisely, we described a set-based technique for searching scores 
of polyphonic music by content. Subsequently, we showed how this technique 
may be exploited in melody-based retrieval, e.g., name-that-tune applications, 
as well as efficient audio identification and monitoring. The underlying general 
technique is not restricted to the field of music retrieval, but may be used for 
searching in general collections of multimedia documents by content [6] Another 
important aspect of the MiDiLiB-project are algorithms for synchronizing music 
given in different formats. We sketched an underlying technique and pointed to 
several applications in the context of digital music libraries. 

The last years have seen significant technological progress in content-based 
music retrieval, where several retrieval tasks such as polyphonic score search and 
audio identification for the first time became manageable on large scale document 
collections. Whereas complex score-based search by now is mostly restricted to 
music experts, existing prototypes for melody search offer a sufficient degree of 
fault tolerance to make them suitable for a broader class of users. 

Based on the technological advances and the large existing collections of 
music data, the next years will probably see an increasing number of publicly 
available (online-) music services. To conclude our paper we briefly sketch such 
a scenario of a client-server based service for real-time exchange of music-related 
information. In our scenario, a modified audio player (serving as a client ap- 
plication) during acoustic playback of an audio track receives the track’s lyrics 
from a server application, possibly located within some library. The lyrics may 
then be displayed synchronously to the actual acoustic playback. Such a service 
may be realized using techniques for audio indexing, identification and audio- 
to-text snyclrronization. For this, an index of audio fingerprints is generated in 
a proprocessing step. The audio fingerprints (consisting of suitable feature sets) 
of a certain track are then suitably linked to the corresponding lyrics. During 
playback of an audio track, fingerprints are extracted from that track. Those 
are transmitted to the server and then used to determine the actual song and 
playback position. Subsequently, the corresponding lyrics are transmitted to the 
client application. Note that in the proposed approach one is not required to 
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make the actual audio tracks publicly available, but works on extracted finger- 
prints only. Besides efficiency issues this has significant advantages concerning 
legal issues such as copyright- and content-protection. 

Future work will be more and more concerned with developing the latter type 
of applications. However, in spite of the significant recent advances in the un- 
derlying technologies for content-based document processing, there is still much 
fundamental research work to be done in the field of semantic content analysis. 
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Abstract. For the content-based management and access to domain- 
specific data in digital libraries, special domain-knowledge and knowl- 
edge processing functionality are required. However, the integration of 
knowledge components has not yet become an integral part of existing 
digital library systems. The current paper represents the realization of 
a digital archive of historical music scores, integrating special domain- 
specific data and functionality for writer identification in historical music 
scores. We introduce the basic formalisms and heuristics for the represen- 
tation of handwriting characteristics. To compare two handwritings we 
propose the usage of a normalized, weighted Hamming distance function 
to calculate the degree of similarity between their handwriting character- 
istics. For the identification of writers we employ the k-nearest neighbor 
method to build clusters of similar writers, based on the calculated dis- 
tance. And finally, we represent and evaluate the test results from the 
prototype implementation of the system. 

1 Introduction 

A necessary step towards improving the usability of digital libraries and archives 
is to provide specialized services, to users with special needs in management and 
representation of documents and knowledge. Examples for special users’ needs 
can be found in numerous domains. In our work we elaborate the requirements 
of a special-users group, of music scientists, towards a specialized digital archive 
of music scores. Historical music scores are valuable sources of information about 
the circulation and practicing of music in the past. Musicologists are concerned 
with the extensive analysis of numerous copies of scores made by professional 
scribes at different geographical locations. At the University of Rostock there 
exist about 5000 handwritten music sheets from the 17th and 18th century. 
The manual management and analysis of such amount of sheets and facts about 
them is a tedious work. Digital libraries provide the organization of the data in 
a way allowing easier access and enhanced research possibilities for the historical 
documents of this interesting collection. 

Few existing digital libraries foster the integration of domain-knowledge and 
user-specific management of information. The digital library project “Perseus” 
[1], for example, aims at providing specialized services for the needs of users of 
cultural heritage digital libraries. One of the biggest digital libraries projects for 
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cultural heritage ECHO (http://echo.mpiwg-berlin.mpg.de) as well as the New 
Zealand digital library project - Greenstone rely on specialized mechanisms for 
content-based text annotation and retrieval supported by the integrated GATE 
software for human language processing system (gate.ac.uk). Multimedia dig- 
ital libraries for video, audio, and images such as the Informedia Digital Li- 
brary (http://www.informedia.cs.cmu.edu) focus on annotating, indexing, and 
retrieval of multimedia content for the extraction of “digested” data. However, 
specialized services for domain-specific data have not yet become an integral 
part of digital library systems. 

In this paper we represent some achieved results in the ongoing project 
“eNoteHistory” . The project has the aim to build a digital archive, integrating 
knowledge components and specialized functions for writer identification in his- 
torical music scores. We have realized the modeling, storing and retrieval of the 
digital documents and their metadata, using existing methods from document 
management systems [2]. In this communication we focus on the integration of 
special domain-specific data and methods for the scenario described in section 2. 
Section 3 contains a formal concept of the knowledge data structure and meth- 
ods. Implementation details and an example are provided in section 4. Section 
5 and 6 conclude with extensive evaluations and a general assessment of the 
system. 

2 A Writer Identification Scenario 

We accomplished several steps, in the process towards developing a specialized 
archive for the identification of scribes in music scores. After digitalizing doc- 
uments and bibliographical metadata, we defined a data model for structuring 
and linking the digital information. The mapping of the unstructured informa- 
tion onto the data model was realized with an adequate preprocessing and import 
procedure. Having accomplished this step we could provide a couple of simple 
search and browsing possibilities in the digital collection of music scores. The 
next step was gathering and representing domain-specific knowledge to be used 
for the analysis of handwriting characteristics in the music scores. And finally 
we formalized and implemented methods for the comparison and classification of 
the handwriting features, to make special queries for the identification of scribes 
possible. 

Identifying and gathering relevant and accurate knowledge is a task, which 
we undertook in cooperation with domain-experts, who are best aware of what 
features are necessary for recognizing a handwriting. They also know well which 
relations and coherences exist between these features. Thus we created a Knowl- 
edge Base consisting of two groups of data. On the one hand, we have the 
descriptions and data structures for representing the domain of handwriting fea- 
tures. And on the other hand, we have measures and evaluations to iterate and 
interpret the data structure, as well as relations and coherences between the el- 
ements of the data structure. These data are our abstract knowledge about the 
collection of music documents. We need to represent each document in terms of 
this abstract knowledge in the form of handwriting feature descriptions. 
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Fig. 1 . Creation of Document Features. 



Figure 1 illustrates the extraction and clustering of features from the digital- 
ized documents. From an abstract point of view the extraction process represents 
the mapping of documents into the feature space, thus allowing us to search in 
the knowledge base for documents, based on their handwriting features. Using 
the features extracted from the music scores and the distance information in 
the knowledge base we could classify the scores according to their handwrit- 
ing characteristics. The result is a set of clusters of handwriting characteristics, 
where each cluster in the best case represents exactly one scribe. It should be 
also possible to classify new handwritten music sheets by trying to find a cluster 
containing similar scores or by creating a new cluster if similar scores cannot be 
found. 

To determine appropriate classification methods we analyzed different classi- 
fication and clustering techniques [3-5]. Inductive methods, such as 1-rule, ID3, 
C4.5, and SVM could not be applied for the classification of music handwrit- 
ing features due to several reasons: the similarity values between the features 
don’t have a conventional metric (a special similarity measure is needed); the 
feature values are nominal scaled so that, e.g., no spanning of a space for SVMs 
is possible; integration of external knowledge is required, e.g., feature priorities; 
overfitting and one-slrot-learning. To cope with these restrictions, we use the 
instance-based method, fc-nearest neighbor [6], which classifies all existing fea- 
ture instances and learns with each new instance. Any similarity measure can 
be used in the fc-nearest neighbor method to calculate the similarity between 
features. The fc-nearest neighbor method has already been used in other writer 
recognition projects, e.g.: a project from the State University of New York [7] 
for writer identification based on 62 symbols subdivided in so called micro and 
macro features; and an information retrieval-based writer identification from 
the University of Rouen (France) [8] for handwriting recognition using patterns 
(writer’s invariants) for recognizing handwritten graphemes. 

3 Feature Base: The Knowledge 

About 80 handwriting features in historical music scores have been defined as 
relevant for the writer identification, by the cooperating group of music scientists. 
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These features are categorized in 13 feature groups: clefs, slant, note stems, note 
flags, note beams, accidentals, note heads, time signatures, bar lines, note beams 
offset, rests, writing habits, staves. Each feature group represents a hierarchy of 
detailed characteristics, which build up the concrete features. In this way the 
different kinds of note flags, for example, eighth and sixteenth note flags are 
grouped together as shown in Figure 2. Each of these two features are further on 
particularized in note flags of notes with ascending and descending stems. The 
node enumeration in the tree hierarchy, represented as a dot-separated, notation, 
e.g., “4.1.1.”, is used for the identification of the features. We refer to the set of 
all feature hierarchies as the Feature Base. The Feature Base contains not only 
the features themselves, but also the possible values for each feature. The set 
of possible values is organized also hierarchically, and represents an extension of 
the feature hierarchy as shown in Figure 2. The feature values are also identified 
with the dot-separated notation. This organization has the aim to facilitate the 
manual handwriting feature analysis. To determine the feature values for an 
analyzed music score, one has to navigate through the tree-like structure. The 
deeper in the hierarchy the feature value is determined the more precise is the 
feature description. However it is not always necessary to choose a value from 
the leaves of the tree. The analysis may stop on a higher level. 
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Fig. 2. Note Flags Feature Hierarchy. 



3.1 Feature Base — Hierarchical Model 

A set of feature values representing the characteristics of a handwriting is a 
feature vector 7 . The feature vector contains values for the features and 

has the form 7 = (vi, ... ,v n ) T where ty is a value from the value range W,- of 
the feature /,. The set of all feature vectors is r ={717= («i, ... ,v n ) , V* : 
ty £ Wi}. The set of all features is F = {/ | f is a feature}. 
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The structure of the Feature Base is realized as an acyclic, directed graph 
(tree): FB = (V, E, /a, v, r), where V (vi, . . . , v n ) is a set of nodes and E CVxV a 
set of directed edges, which connect the nodes. All feature groups are represented 
in the same tree, thus a common root for the Feature Base is defined. Each node 
has a description y. Furthermore, each node has a sequential number, which 
distinguishes it from its neighbor nodes from the same level v : V — » N. The 
function r is defined as r : V — > Type. This function assigns each node from 
the tree a value from the set Type ={‘prehx’, ‘value’, ‘feature’} where: \/x £ V : 
t(x) = ‘feature’ => Vy £ A x : r(y) = ‘value’; and \/x £ V : 3y £ A x : r(y) = 
‘feature’ => t(x) = ‘prefix’. A node x with t(x) = ‘feature’ is called a feature. 
The underlying partial tree is the value range of this feature. Each node in the 
underlying tree has the type ‘value’. Thereafter, we can refine the definitions 
for the set of Features F and the set of values Wy. F = {/ | / £ V A r(/) = 
‘feature’} and W t = A f i for all /, £ F. 

The path in a tree represents a unique identifier for a node: PAT Ft : V — > P. 
P is a set of paths which are built in the following way. A path PATFt(x) for 
a node x is calculated by following the nodes from the root node to the node x. 
PATFt(x) is a sequence of the numbers of each of these nodes in the order from 
root to the specified node with a separating point between each number. 



3.2 Distance Measure 



Two handwritings are similar if the distance between their feature vectors y a and 
7 b is small. Therefore we need a distance measure of the type dr : T x T — > 
[0 ... 1] to compare the two vectors. The normalized weighted Hamming distance 
function proved to return the best results (see tests in section 5.2). Thereafter, for 
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\) T and 75 = (v \, . . . , v%) T we use: dp = 
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where 



d.f i is the distance between vf and v\ and uy is the weight of the feature /) . The 
distance between each pair of features /, from the feature vectors is represented 
by df t : Wi x Wi —> [0..1], where W t is the value range of the feature /, . The 
distance function df t satisfies the following conditions: (a) Wx,yeWi : d(x, y) > 0 
with d(x, y) = 0 only if x = y; (b) Vx, yeWi A x and y have the same path length 
from the root: \/x'eA x : df i (x',y) > df^x^y) \/y'eA y : df^xpy') > df i (x,y)-, (c) 
The distance function is not symmetrical: \/x,y : d(x, y) d(y,x). This means 
that if a typical handwriting feature of a writer is to put a curl at the end of a 
music element line, it would be more probable that the same writer omits the 
curl in some cases than if a writer who usually does not use curls decides to 
put a curl at the end of a clef line; and (d) The triangular inequality rule is 
also not valid, because the categorical values do not have a continuous order: 
Vx, y, zeWi : d(x, y) ^ d(x, z) + d(z, y). 

We distinguish three possibilities to measure, represent and interpret the dis- 
tance between single features. (1) Comparison of values: The simplest solution 

{ 0 x f x — xj 

l’ . This distance metric 

is however for most features inappropriate because even very small changes in 
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the handwriting of the same scribe will be treated in the same way as sig- 
nificant differences between the handwritings of different scribes. Thereafter, 
a measure is needed which maps the similarities in the whole value range be- 
tween 0 and 1. (2) Comparison of values based on value analysis: If the features 
are represented as digit sequences (derived from the dot-separated notation), 
they can be split into digits and thus compared. We can extract the digits 
of two feature values xp = PATH(x) and yp = PATH{y) in (xp, . . . , Xp) 
and (yp,...,yp) respectively and compare each pair from the same hierarchy 



E 



,v J P ) 



level: dj 3,a (x. y) = 1 fi n ' F ' ap ’ , This distance function has an accuracy 

of - where n is the depth of the value range tree. The fact that the func- 
tion is still based on boolean logic does not allow similarities between values 
from the same level to be found. (3) Comparison of values based on value 
analysis and additional information: The similarity between the feature val- 
ues, was defined lreuristically by the music scientist for each feature tree. For 
most of the features the similarities between the feature values are known not 
only for a level but also for whole value range. We use a distance matrix to 
represent the similarities as scalar values, between each pair of feature values: 



•‘'* 1 , 1/1 



.. d 



x n ,y i 






*xi,y n 



.. d. 



dCn iVn 



, Vj : Xj,yjeWi. Therefore, the distance function 



can be represented in the following way: d 1 ^ (x,y) = d x , y with d x , y from disj (q 
on position (a;, y). 

We use (3) to represent the distances between most of the feature groups. 
However, there are some exceptions, which make use of the (1) and a combination 
of the (2) and (3) metric. 



3.3 Special Feature Values 

Null Values. We distinguish two types of null values, which can appear in the 
handwriting feature vectors: 

(1) Non-Information-Null (?) - there exists no information about a certain fea- 
ture value; e.g., in a particular music score the C-clef is not used, thus no value 
can be defined for the writer. In this case a distance between two features can 
be calculated only if both values are not equal to (?). In the calculation of the 
Hamming distance participate only those pairs of features, for which the dis- 
tance can be calculated. We named the distance between two features, where 
none of the values is equal to (?) a usable distance function. df t (v 1 , 1 / 2 ) is usable 
<t=> v\, V 2 tWi Aiq 7 ^ ? AV 2 yf ?• Then the normalized, weighted Hamming distance 

V'w. . Wi*d t. 

will look like this: dr = — ^ fi -. 

usable W i 
n 

(2) No- Applicable-Null (T) - a value for this feature will never exist for a writer. 
This null value is regarded as an additional value in the feature value range, and 
the distance between the no-applicable-null and all the other values of a feature 
is the maximum distance between features. 
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Complex Values. Until now we have considered only the case when for each 
feature in a feature vector exactly one value can be assigned. However, in a more 
general scenario we have to consider that there could be more than one value for a 
feature assigned: 7 c = (U, •••, V n ) T , V?' € {0..n} : V) C W,. Therefore, we defined 
a distance function for complex features: (fj, : fp(Wi) x *P(H/) — + [0...1]. l)3(Wj) is 
a power set of (Wi), which satisfies the condition: rniny v i € y 1 ^ v 2 e y 2 (d/* (f 1 ,v 2 )) < 

V 2 ) < — Wvl ^ V ]vl\y / \V 2 \ — ■ We tested different possible distances (see 

section 5.2). It turns out that the best value comparison results are achieved 
when the distance function value tends to minimum. The conditions concerning 
the non-information-null values have also an effect on the calculation of the 
distance function for complex values. A complex feature distance function we 
name usable when none of its arguments is equal to (?). df t (Vi , V/) is usable 
<t=> Vi, V 2 C Wi A Vi yt 0 A V 2 y^ 0. The final form of the Hamming distance is: 

usable Wi * d h 

d p — - — ^ ^ . 

1 Vi:d.K usable Wi 

* i 

4 The Implemented Prototype 

The implemented prototype 1 is based on an object-relational database manage- 
ment system (IBM DB2). The database consists of schemas for metadata and 
images of music scores, for the Feature Base, and for the handwriting feature 
vectors of writers. It also integrates functions and methods for distance deter- 
mination and clustering/classification of handwritings. The current version of 
the prototype has web interfaces for searching and navigating metadata and im- 
ages and an interface for recognizing similar writers, using a describing tool for 
handwriting features. 

4.1 Existing Tools 

We set the following requirements for the implementation of the prototype ap- 
plication: (1) Support for instance-based classification, (2) Unlimited choice of 
distance functions, (3) Unlimited adaptation of distance parameters, (4) Choice 
of a distance matrix for each attribute, and (5) Support for null values and mul- 
tiple values. We could identify only three data mining tools, which satisfy our 
first criteria: Darwin, MLC++, and Weka. 

Darwin [9] uses an instance-based classification method ‘Match Model’ which 
is based on the fc-nearest neighbor method. The value of k and the weights can 
be modified. Null values are supported, using the data preselection possibilities. 
However, multiple values and user-defined distance matrices are not supported. 
Weka [4] has two classes IB1 ((I)nstance (B)ased with k= 1) and IBk also based 
on the fc-nearest neighbor method. The value of k is modifiable, but the distance 
function is not. Weights can only be determined by a training set. It is though 
possible to specify a distance matrix. The source code is free and could be 

1 The prototype is available at http://www.enotehistory.de 
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used for a modified implementation. MLC++ [10] provides an abstract class 
for classification. A subclass IB is an implementation of Aha [11]. The feature 
weights for the classification are modifiable. Both MLC++ and Weka can be 
used. But, we need significant modifications in order to adapt the standard 
methods to our needs. Furthermore, considering the need of integration of the 
methods in the database management system, we decided to reimplement the 
methods taking account of each requirement that we have. 



4.2 Implementation in a Database Environment 

The Feature Base was implemented as a database schema in the IBM DB2 
Database Management System (in principle the use of any other object-relational 
database system is possible). The integration of the classification functionality 
into the database environment was planned, in order to provide a transparent 
interface with short communication ways for the clients. 

IBM DB2 supports the integration of methods by so called user-defined func- 
tions (UDF) or stored procedures. UDFs are usable within SQL statements, 
where stored procedures are only accessible by a special call mostly from an 
application. 

We implemented two user-defined functions Apv and A$- Apv returns the 
k-nearest feature vectors for a given query vector Apv = { 7 eT|dr( 7 , y g ) < t } 
using the distance function dr- The threshold t influences the number of relevant 
results. A list of relevant scribes is returned by the A$ function based on Apv- 
Both functions use a couple of interfaces to features and distance matrices in 
the database. The result table includes a feature vector or a scribe name with a 
similarity measure. 



4.3 Example 

The procedure from the recognition of features to the scribe identification is 
shown by the example in Figure 3. 

First, the music scientist has to recognize a characteristic notation of the 
g-clef in the original music score. Then, he/slre has to navigate through the g- 
clef feature tree to find the most similar notation in the tree (in Fig. 3 it’s a 
g-clef with a closed and descending loop left of the g-point). This notation is a 
possible value of the g-clef. The numerical value (...1.2.2.1.7) is needed for the 
similarity determination, which follows. To determine the similarity between two 
different notations, we need the similarity matrices. If we compare the g-clef of 
this example handwriting with another one with g-clef value .. .1.2.2. 1, we can 
find the similarity of 0.6 in the g-clef matrix. For a convincing similarity between 
both handwritings at least 40 % of the features have to be determined. Using 
only the g-clef would lead to a low similarity value, because the g-clef is only 
one of 80 possible features. The following table 1 shows a result set of a query. 
The result set includes the similarity, a quality measure, and the scribes name. 
Scribe names are often fictitious, because the name is not known. 
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Fig. 3. Example of a Semi-Automatic Feature Extraction and Writer-Identification 
Procedure. 



Table 1 . A result set including similarity, quality measure (Nd see following evaluation 
measures), and the writers name (_a and _b specify different periods of a writer). 



No 


Similarity 


Quality 


Scribe 


1 . 


0,018 


0,534 


AN305_a 


2. 


0,091 


0,564 


AN305_b 


3. 


0,126 


0,548 


Reichardt I_a 


4. 


0,192 


0,394 


J.M. BaldaufLa 


5. 


0,214 


0,434 


Biichler IV_a 



5 Evaluations 

To test our approach, we defined measures for evaluating the distance function, 
its parameters, the feature weights, and the query results. The objective was to 
achieve an optimal configuration of the current system which meets the require- 
ments of the music historical experts. 



5.1 Evaluation Measures 

Scoring Function : We use a scoring function (SF) for instance-based clustering 
methods to evaluate the distance function and its parameters. SF is a measure of 
the tightness of a cluster [3] . The distance between instances of the same cluster 
has to be as small as possible, compared with distances between instances of 
different clusters, in order to reach an optimal scoring function value: 

QJP _ £v (71 , 72 )Er2 dr(7i,72)/|r^| 

- Ev (71l72)e r^r( 71,72)/|/T|’ Wlth 

rz. = {(7i,72)|7i 7^ 72, 7i and 72 are instances of the same cluster}, and 
Fi = {(7i,72)|7i yf 72, 7i and 72 are instances of different clusters}. The aim 
is to maximize the SF measure. 

In the current case we do not need to perform the clustering step because 
we already have a set of predefined clusters of writers. Therefore, we can use 
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this measure to evaluate directly the distance function and the parameters. A 
maximum value for the SF measure indicates a good choice of the distance 
function and its parameters. 

Precision/ Recall: The evaluation measures, precision and recall [12], are 
broadly used for evaluating information retrieval systems. We adopt these mea- 
sures to evaluate the developed system for writer identification queries. For each 
query Precision = VXNfl anc i Recall = ' Ar ^ B > ; where A is a set of all relevant 
features of the query, and B is the set of features in the result set. Precision is a 
measure of quality and recall is a measure of quantity regarding query results. 

K = kind of average precision: K is used to evaluate the result sets after their 
interpretation of the best result feature. The result of a query for identifying 
a writer should contain the correct writer class. Precision/recall evaluates the 
system based on information retrieval aspects: whether the best results have 
the highest relevance and the last results the lowest relevance. Contrary to that 
measure, K evaluates the system based on classification aspects: either the right 
class is recognized or not. K = ^, where X is the number of all result sets with 
correct identified writers and N is the number of all result sets. 



Impact of Null Values (Np,Np>): We defined two measures to evaluate the 
impact of null values. The first measure represents the relationship in a feature 



vector between the weights of null values and the weights of all values: Np( 7 ) = 

— . The second measure defines the proportion of features which are 

not used for the distance calculation due to a null value in one or both of the 



vectors: N D (d( la, lb)) = 



- / Vi:v9‘ =null\/v. =null 



5.2 Evaluation Interpretations 

We applied the evaluation measures in order to prove the reliability of the meth- 
ods and query results. Apart from the particular tests, some of them (distance 
function, distribution of feature weights, and distribution of feature values) were 
combined to consider possible interferences between the tests. 

Distance Function. Different distance functions, including Hamming Distance, 
Euclidean Distance, and one higher order distance metrics were tested. The 
higher order distance measures increased the influence of bigger differences be- 
tween single features on the overall feature vector distance. The Hamming dis- 
tance led to the best results according to the SF and the K measure and proved 
to be significantly better than the Euclidean distance and the higher order dis- 
tance. 



Feature Relevance and Weights. Every feature participates with a certain weight 
in the calculation of the overall distance between feature vectors. The feature 
weights can be used to increase or decrease the impact of a feature on the result 
of the distance function. We determined the weight of features by calculating 
the distance between two feature vectors omitting the feature for which we want 
to determine the weight. If the result of the distance function improved, the 
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omitted feature received a lower weight and vice versa. The next step was to 
estimate the distribution of weights for each feature. We tested (1) a constant, 
(2) a linear grouped, (3) a linear, (4) a quadratic grouped, and (5) a quadratic 
distribution. The test results show that the SF improves after each test from (1) 
to (5) and K improves in the opposite direction from (5) to (1). Therefore, we 
chose to use the linear distribution as a compromise for a good SF and a good K. 
This distribution corresponds also best with the expert knowledge of the music 
scientists. 



Optimization of Distance Matrices. The similarity values in the distance ma- 
trices were heuristically determined by music scientists, without considering the 
classification/clustering algorithms. For example, a value of 0.95 is very close to 
a 1.0 and this distance could be overvalued. Therefore, we needed to check the 
range of all values in a matrix by altering them with the following functions: 
translation with x — 0.2, x — 0.1, x + 0.1, x + 0.2; scaling with x/A, x/2, x*2\ 
exponentiation with x 2 , x 3 , y/x, y/x. The results indicated that the preliminary 
determined values were slightly too high. Functions which decreased the values 
led to a better evaluation. But, the improvement is not significant. Thus, we can 
use the original values further on. 



Null Values and Multiple Values. The presumption that incorrect or false dis- 
tances between feature vectors are based on too many null values (low No) could 
be validated. Correctly recognized writers have less null values in their feature 
vector (high Nf). In the case of multiple values for a feature another problem 
is detected. Which one of the possible values should be taken for the distance 
calculation? The presumption is that the resulting distance should be between 
the minimum and the average of the possible distances. We tested: (1) the mini- 
mum min\/ v i eVltV 2 e v 2 {df i {v 1 , v 2 )), (2) the maximum maxy v i eVliV 2 eV2 (df t (v 1 , u 2 )), 



(3) the arithmetic average 



5? Vv 1 eVi ,v 2 cV 2 dft (v 1 ,V 2 ) 

|Vi|x|V 2 | 



(4) the maximum in direction 



t/Ev. ^eV 1 ,v 2 eV 2 d h(Nv 2 ) k ,-y,, . . . ,. , E Vt>l e Vl ,v 2 eV 2 \J d h ( yl > v 2 ) yk 

Y \Vi\x\v 2 \ > I 5 ) the mmml um m direction ( \Vy\x\V2\ ) • 

The minimum method (1) results in best SF values, the minimum in direction 
(5) results in best K values. 



Precision/Recall. We measured precision and recall for a set of queries with a 
threshold between 0.5 and 0.01 (see table 2). Each time we chose one feature 
set from 150 test feature sets and made a query with the chosen feature set to 
the remaining sets of features. The precision and recall values are the average of 
the precision and recall for all the queries. The best precision/recall for a writer 
identification query is about 85% precision and 75% recall. 



6 Conclusions 

The aim of the project “eNoteHistory” is the development of a system to support 
music scientists in the study and the analysis of historical music documents. A 
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Table 2. Test results (abridgment): Precision, Recall, K, and Null Values using a fixed 
threshold value. 



Threshold Value 
t 


Precision 

PR 


Recall 

RE 


K 

K 


Nd reg. right 
class, instances 


Nd reg. false 
class, instances 


0,50 


0,201 


0,918 


0,382 


0,392 


0,357 


0,40 


0,256 


0,918 


0,382 


0,392 


0,357 


0,30 


0,335 


0,911 


0,427 


0,388 


0,349 


0,20 


0,668 


0,766 


0,674 


0,403 


0,332 


0,10 


0,883 


0,550 


0,787 


0,403 




0,01 


0,898 


0,093 


0,461 


0,481 





simple management of metadata and digital documents, e.g., in a digital catalog 
system, however, was not enough, to satisfy their requirements. The need to 
integrate special methods into the catalog system to enhance the possibilities for 
data usage and retrieval, led us to the definition of a specialized archive system. 
We integrated methods and domain-knowledge components to enable the semi- 
automatic handwriting recognition for writers of music scores. The recognition 
accuracy of this early prototype of the system is about 90%. The confidence of 
this evaluation, however, is still not strong enough, because we have performed 
our tests using only about 150 feature sets. Another project partner works on 
an automatic procedure for the recognition of features which can improve the 
tedious manual work of feature analysis. That implies the future integration of 
the automatic feature extraction in the system. Thereafter, we hope to receive 
more feature sets to give a better evaluation of the system. Nevertheless, until 
now we have defined an extensive test environment and we have learned much 
about adapting and improving the performance of the system, using different 
parameters. 
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Abstract. Kepler is an attempt to bridge the gap between established, organiza- 
tion-backed digital libraries and groups of researchers that wish to publish their 
findings under their control, anytime, anywhere yet have the advantages of an 
OAI-compliant digital library. We describe an architecture and implementation 
of the Kepler system that allows an archivelet to be installed in the order of 
minutes by an author on a personal machine and a group server in less than an 
hour. The group server will harvest from all archivelets and make the union of 
all published papers available for search to a community. We describe how a 
group administrator can provide an XML schema for the metadata and how the 
Kepler engine will validate against them when an author publishes a paper and 
completes the metadata. We have demonstrated that we can surmount the tech- 
nical difficulties for authors to publish as easy as to a website yet produce OAI- 
compliant digital libraries. 



1 Introduction 

One of the largest obstacles for information dissemination to a user community is that 
many digital libraries use different, proprietary technologies that inhibit interoperabil- 
ity. The Open Archives Initiative (OAI) addresses interoperability by using a frame- 
work to facilitate the discovery of content stored in distributed archives through the 
use of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [1 ]. 
Realizing the benefits of OAI, a number of communities are interested in an out-of- 
the-box solution that will help them deploy OAI-based digital libraries. However, 
building a communal digital library is currently severely hampered by the lack of easy 
to use tools that address the diverse requirements of different communities. In particu- 
lar, metadata, as the codification of the worldviews that define a community, needs to 
accommodate varying formats, uses, encodings and pedigrees. Creating a system for 
communal digital libraries poses a number of challenging research questions: 

One of our initial efforts in this direction is Kepler [2], which gives publication 
control to individual publishers, supports rapid dissemination, and addresses interop- 
erability. In Kepler, OAI-PMH is used to support “personal data providers” or “ar- 
chivelets”. Archivelets are meant to be “personal pocket libraries” that are on the one 
hand OAI-PMH compliant data providers, and on the other hand overcome the reluc- 
tance of authors to publish into a digital library (instead of putting their work on their 
website) through user-friendly publishing tools and total control through having all 
files and metadata reside on the author’s personal machine. This latter characteristic 
has some serious implication on the reliability issue since these machines are not 
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necessarily always up or connected to the Internet (e.g. the author might keep the 
archivelet on a laptop and be on travel or on a field trip). In our vision, individual 
publishers can be integrated with an institutional repository like DSpace [3] via a 
Kepler Group Digital Library (GDL). The GDL aggregates metadata and full text 
from archivelets and can act as an OAI compliant data provider for institutional re- 
positories. As a demonstration, we provided an initial registration service and a ser- 
vice provider at Old Dominion University. Once an archivelet registered with our 
registration service, the service provider could harvest metadata from it. We faced a 
number of issues during the initial deployment of Kepler: the software did not have 
the flexibility of customizing and deploying it for a community, and archivelets were 
often installed behind NATs (Network Address Translator) making it difficult for the 
service provider to harvest them. In this paper, we build upon our experiences with 
the initial Kepler distribution and describe tools and software for groups within com- 
munities to deploy digital libraries that are customized for their needs, easily popu- 
lated, managed, and “open” for development of future services. The main contribu- 
tions of this paper are: (a) a modular framework for defining, describing and support- 
ing the publication/dissemination requirements for different communities; (b) en- 
hanced packaged tools and software to create an out-of-the-box solution for deploying 
communal digital libraries for diverse groups; (c) for use behind firewalls, we have 
developed a server-based publication tool that mirrors the archivelets in terms of 
functionality. The architecture of just the archivelet is summarized in [4]. At this point 
we do not have the implementation of hierarchically aggregating groups into commu- 
nities. There are a number of issues we are currently addressing and will report on in 
future reports. For the test deployment, we are working with the US Geological Sur- 
vey (USGS), Los Alamos National Laboratory (LANL), and the Open Language 
Archives Community (OLAC). The rest of this paper is organized as follows: section 
2 presents the Kepler architecture as it has been released in open source. Server-side 
archivelet is presented in Section 3. In section 4, we present the new features we have 
added to the Kepler system and section 5 presents related work. We present future 
work in section 6. 



2 Enhanced Archivelet 

The original archivelet implementation [2] only supported Dublin Core (DC) format. 
We remedied this problem by allowing an arbitrary number of formats to be realized 
by one or more communities. We have re-engineered the architecture to be extensible 
with regard to formats and functions (See Figure 2). The new design has a well- 
defined API specification defining the various functions that are implemented by 
every module that in turn are available for other modules. Support for new metadata 
formats requires just the implementation of a metadata driver module. In the Kepler 
software documentation, we provide developer guidelines for fast and easy implemen- 
tation of a metadata driver. The metadata manager module is responsible for instanti- 
ating the various metadata drivers the system is configured to use. It also implements 
the OAI-PMH API that provides a method for each of the six OAI-PMH requests. 
OAI-PMH requests received by the Webserver module are forwarded to the Driver 
Manager that decides what metadata drivers are involved and invokes these drivers to 
get partial responses from each. The Driver Manager then constructs the whole re- 
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sponse from these partial responses. The Driver Manager also implements a User 
Interface API. This API contains methods that are invoked in response to user interac- 
tions with the main interface. For example, when the user clicks “publish”, the Driver 
Manager brings a simple GUI that allows the user to select which metadata format she 
wants to use and then the Driver Manager invokes the appropriate Driver to display 
the appropriate Publishing tool (Figure 1). We refer to this tool as a publishing tool 
rather than a metadata editor because it does involve the infusion of the full text 
document into a repository in the local archivelet. 




Fig. 1. Main Interface and OLAC Publishing Tool for Archivelets 



The metadata driver module implements the OAI-PMH processing and the user in- 
terface functions such as publishing tools for the specific metadata format that the 
Driver handles. The publishing interface (bottom left of Figure 1) has dynamic field 
types (mandatory or optional), which are determined by a configuration file, based on 
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XML schema, managed by the group server administrator. The Driver invokes the 
Validation module whenever new metadata is published to validate the metadata 
against the constraints specified in the configuration file and uses the repository API 
to store metadata and files. 




Either Database or Local File System 
periodically updated by GroupServer 



Fig. 2. Archivelet Architecture 



3 Server-Side Architecture 

We discovered many situations required a server-side solution. Such examples in- 
clude when the firewall is not controllable by the author, or when the author is on 
travel to a different organization, or when the author’s organization has a strict secu- 
rity policy, or when only Internet access is available but not the personal laptop with 
the archivelet. We sometimes refer to it as Internet cafe publishing. The main advan- 
tage of the server-side archivelet is that it can be accessed from anywhere. 

The Kepler Server-side archivelet can be accessed from Kepler GDL and is gov- 
erned by the validation rules specified by that GDL. The functionality as the user sees 
it is identical, thus it supports publishing metadata in DC and OLAC formats, creates 
persistent URLs for each individual archivelet, and allows any service provider (e.g. 
GDLs) to harvest metadata and allows users to import/export metadata. 

The server-side archivelet system can be logically divided into 3 layers as shown in 
Figure 4: View (User Interface), Controller, and Model. Comparing to Figure 2, it can 
be seen that the basic, extensible driver architecture is the same, though not the im- 
plementation; the difference lies in the user interaction modules and the interaction 
with the GDL. In the server-side architecture, the archivelet resides on the GDL. The 
user interface layer provides various user interfaces for authors to create an archivelet, 
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Fig. 3. DC Publishing Tool for Server-side Archivelet 

publish metadata and perform the other activities defined in the main archivelet inter- 
face (Figure 3). Compared to the archivelet interface we needed to make some ad- 
justments as we wanted to keep it strictly within html to make it as widely operable as 
possible. The controller layer has several components (servlets and metadata drivers) 
that control the sequence of transitions between user interfaces, flow of data between 
user interfaces and the database, as well as handle OAI-PMH requests. The repository 
layer performs the actual database operations. The key component of this layer is the 
DB Repository. It has the capability to handle metadata regardless of its native for- 
mat. It provides database operations that are required for archivelet registration, con- 
figuration, saving and extracting metadata. 



4 New Features 

Persistent URLs: Archivelets are roamers; they can be at different locations with dif- 
ferent IP addresses at different times. One of the metadata items in each published 
record in the archivelet is the URL of the document associated with the metadata. The 
document is stored on the machine the archivelet resides. Clearly we have a problem 
with using an archivelet’ s current IP addresses as part of the URL. We use instead 
persistent URLs for archivelets and their metadata records and their documents. In 
prior implementations every archivelet had a base URL of the format: 

http://machinename.or.ip[:port]/OAIRequestHandlerScript 

This base URL was distributed to service providers (e.g. Kepler group servers), so 
they could issue OAI-PMH requests to harvest metadata from the Archivelet. It is 
possible that an archivelet can be installed on a machine whose IP address changes 
every time it connects to a network. In this scenario, if the change in an archivelet’ s 
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Fig. 4. Server-side Archivelet Architecture 



base URL is not updated at the service providers, they cannot contact the archivelet to 
harvest metadata. A persistent URL for an archivelet is a base URL that is independ- 
ent of archivelet’ s IP address. The archivelet distributes this persistent URL to service 
providers instead of its actual base URL. 

To use an archivelet for publishing metadata, it has to be registered with a Kepler 
GDL. While registering, the archivelet sends its actual base URL, along with other 
information, to the group server. On successful registration, the group server creates a 
persistent URL of the format 

http : //group . server . url [ :port] /path- to - 

OAI Control Servlet/ tr/ archivelet -name 

http : / /kepler . cs . odu . edu/ testgroup/ servlet/OAIControlSe 

rvlet/ tr/maly 

The group server stores a mapping between an archivelet’ s actual base URL and its 
persistent URL. Also, it sends this persistent URL to the archivelet as a response to 
the registration request. To ensure that the persistent URL always maps to the most 
current base URL, the archivelet sends its current base URL to the group server at a 
predefined interval (e.g. on every startup) and the group server updates the persistent 
URL mapping for that archivelet. 

When an OAI-PMH request is issued to this persistent URL, the OAIControlServ- 
let on the appropriate Kepler GDL gets the request. The OAIControlServlet parses 
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this request into two parts: the persistent URL and the OAI-PMH request (verb and 
parameters). Using the persistent URL it looks up its mapping table to resolve the 
actual base URL of the archivelet. Once it obtains the actual base URL, it appends the 
OAI-PMH request and directs the request to the archivelet. The archivelet responds to 
the OAI-PMH request and the OAIControlServlet redirects it to the actual requestor. 
Thus the persistent URL masks the dynamic nature of an archivelet’s base URL. 

Validation Tool: One defining feature of a group within a community is its publica- 
tion process and the level of associated control. The minimal requirement that Kepler 
institutes is the use of DC metadata. However, DC in itself has no mandatory fields 
and its is up to a community to define the rules on how exactly a field should be filled 
and how it is to be used. This motivates us to adopt a process that can be described, 
and more importantly, enforced. The Validation module serves two purposes: (1) it 
provides declarations and implementations of a set of Validation APIs(figures 2 and 
4), that can be called by drivers (or any other Kepler modules) for metadata valida- 
tion; (2) it provides a set of tools to facilitate the driver developer to generate an ad- 
ministration GUI for a new metadata set. The motivation of such an administration 
GUI is to give the group administrator some flexibility to tailor a general metadata set 
to an enforceable set of guidelines. For example, some groups may require that the 
creator field of an OLAC metadata record published in this group is mandatory and 
must begin with a capitalized letter. Such constraints are not documented in the pub- 
lished OLAC XML schemas. By using the administration GUI, a group administrator 
can impose these constraints conveniently in her group without directly editing the 
OLAC schema files. Figure 5 shows the administration GUI while tailoring an OLAC 




Fig. 5. An example administration GUI for OLAC 
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metadata set. The general concept is to allow for entities at lower levels to impose 
more rules than are defined at higher levels. As long as they can be easily mapped 
into a specification at say the OLAC level to adhere to OLAC’s schema. 

The current implementation of the Validation module allows a group administrator 
to: (1) specify whether a field is mandatory or optional; (2) specify regular expres- 
sions for a field (for example, the date field must be in the format yyyy-mm-dd); (3) 
add/delete items in the option list for some specific fields (for example, the value of 
language field can only be chosen from “dz”, “el”, “en”, “eo”, “es”). The validation 
module is designed to work from an XML schema describing a metadata set. In this 
implementation we assume that the XML Schemas for any new metadata set always 
will follow the authoring style of DC and OLAC schemas in defining and qualifying 
elements by means of Element Refinement and Encoding Schemes. The Validation 
module depends on these authoring styles as useful heuristics when extracting initial 
configuration from the original metadata XML Schemas. The initial configuration is 
extracted, then stored in an intermediate configuration file, which maps an XML 
namespace and element/type/attribute pair to its mandatory/option setting, regular 
expression and option list documented in the original schemas. The subsequent 
changes on the configuration that a group administrator makes through the 
administration GUI are written into this intermediate configuration file, without 
polluting the original metadata schemas. 

As part of a new metadata driver, the administration GUI needs to be implemented 
by the driver developer. To facilitate this process, the Validation module introduces 
an additional layer of indirection - GUI configuration file. A GUI configuration file is 
actually an XML instance file controlling the tree structure displayed on the left panel 
of the administration GUI. Specifically, each element in this XML file serves to map 
between a tree node in the GUI and an element/type/attribute defined in the metadata 
schemas. By editing this file to specify fields that are eligible for tailoring by group 
administrator, a driver developer can make use of the tools provided by the Validation 
module to generate the administration GUI automatically. 

NAT/Proxies : Network Address Translator (NAT) Proxies are issues specific to the 
archivelet as it runs a server that listens for OAI-PMH and file download requests. 
The existence of a NAT might cause the communications between the harvester and 
the OAI-PMH server in the archivelet to fail. For handling NAT/Proxy issues, we use 
port mapping (also called port forwarding). It requires that the user configure the 
NAT/Proxy device or software to forward incoming communications on some port to 
the archivelet software. 

Import/Export: From our own internal use of archivelets, it became clear that there is 
a need for archivelet information exchange. Although archivelets are harvestable from 
the outside, they did not originally have the capability to receive information (OAI- 
PMH does not support two-way traffic). We decided to adopt OAI-PMH Static Re- 
pository [5]. The static repository specifies an XML schema that contains all the nec- 
essary information for OAI-PMH “frozen” in a single file. The use of a Static Reposi- 
tory Gateway allows a harvester to harvest a static repository as if it were a regular 
OAI-PMH data provider. Since in Kepler the archivelet typically contains less than 
100 records, it is easy to transmit all the information in the OAI-PMH Static Reposi- 
tory format. This is done with the export/import feature in Kepler. It enables the user 
to export the metadata in her collection to a file, which can then be shared with others 
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directly to avoid re-entry of metadata again and again. A typical scenario is when an 
author has several co-authors and can share the metadata with them, having them 
avoid entering the information into their archivelets. This does raise though the issue 
of duplication at the group server. In this implementation we have not addressed the 
duplication issue; duplicate records will simply be listed multiple times as coming 
from different archivelets. 




Fig. 6. Results from the Simple Search 



Search Sendee and Caching : The group server offers a search service on the metadata 
harvested from the Kepler archivelets (and other OAI-PMH data providers). The 
search service is based on the Arc search engine [6, 7]. The group server offers a 
simple search interface that searches all metadata fields, an advanced search interface 
that allows fielded searching, and a browsing interface (any of the archivelets that the 
group server harvests can be browsed) for extemporaneous resource discovery. Using 
the demonstrator group server in use at the Old Dominion University Computer Sci- 
ence Department, Figure 6 shows the metadata records resulting from a query. The 
group server performs caching by default. Clicking on the title in a record (Figure 6) 
displays all the metadata fields of that record in a separate window. The original re- 
source is available in the DC. Identifier field, for example: http://128.82. 7. 77:2048/ 
EDICTfinal0913.doc 

However, the URL of the original resource may not always be available. Although 
the original URL is always presented, the GDL will also preemptively cache the re- 
source at the GDL. Preemptively caching the resources increases the availability of 
the resources and insulates the GDL from the varying accessibility of the archivelets. 
Lor example, the URL of the above source is also available at: http://kepler.cs.odu. 
edu:8080/testgroup/cache/oai. ODU_DLPublications.EDICTfinal0913.doc. 
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5 Related Work 

The version of Kepler described in this paper draws from a significant base of exist- 
ing OA1-PMH projects, developed both at Old Dominion and throughout the commu- 
nity. In addition to the original Kepler project [2], some of the features draw from the 
Arc [6, 7] and Archon [8, 9] projects. As mentioned above, the OAI-PMH Static Re- 
pository format [5] was adopted when the need for archivelet importing and exporting 
was addressed. 

In the emerging field of “institutional repository software”, many open source en- 
tries have emerged. The eprints.org software [10] from the University of Southampton 
has been widely adopted. CDSWare [IT] was created at CERN and has been adopted 
in other locations as well. DSpace [3] is created by Hewlett Packard and the MIT 
Libraries, but has been widely adopted outside of MIT. All of these systems feature 
significant features and capability for building large-scale digital libraries and institu- 
tional repositories. All also use the OAI-PMH as a core technology. However, they 
are in contrast to Kepler in that they require significant resources to establish and 
maintain; the installation of these systems falls significantly beyond the 10-minute 
target of a Kepler GDL. The choice of a community of institutional digital library is 
an extremely important one, and we recommend surveys such as [12], [13] or [14] as 
guides in determining which DL suits your needs. 

The persistent URL work is similar to the Extensible Repository Resource Loca- 
tors (ERRoLs) project currently underway at OCLC [15], and which was first de- 
scribed as “Partial PURL Redirects” in [16]. ERRoLs, and its predecessor, describe a 
way to attach persistent, “human-friendly” URLs to OAI-PMH repositories and meta- 
data objects inside those repositories. Our approach differs in that we focus only on 
repository naming, and is CDL/GDL-centric - as opposed to centric to a registry at 
OCLC or UIUC. 



6 Conclusions and Future Work 

We have done extensive use testing of both the installation process and the publishing 
tasks. The testing was done by project participants and a few persons within the par- 
ticipating institutions: LANL, CERN, UPenn (now University of Melbourne), USGS, 
and ODU. The original user interfaces (both publication and resource discovery) are 
surprisingly similar to current interfaces. Most of the changes are behind the scene or 
are new features. All of the new features such as server-side archivelet, export/import, 
persistent URLs, and validation are the result of perceived needs by the project par- 
ticipants. Installation packages for group server (including archivelet server-side im- 
plementation) and archivelet are available in the project website shown in [17], At the 
site a test group server can be found that can be used by anyone to experiment with 
creating archivelets of their own and seeing them harvested by the test group server 
(as long as the owner registers with the test group server). Most importantly, we have 
tested repeatedly with both experts (developers) and novices (faculty and students not 
related to the project and system people for the group server) that the installation is 
indeed within the target of 10 minutes and 1 hour respectively for the archivelt and 
the group server. 
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In the near term future we are set to field test Kepler version 1 .2 in four testbeds. 
USGS has developed a public warehouse of publications (http://infotrek.er.usgs.gov/ 
docs/usgs_pubs/publication_warehouse_contents.html) and wants to develop a har- 
vesting and data provider interface using the Kepler concept. The data provider end 
will connect to a new Journal on Natural Organic Materials that will use directly the 
Kepler software. The harvester interface will be a Kepler Group server that will con- 
nect to the USGS’ other offices. OLAC is considering using a Kepler group server to 
harvest from its current collection and vice versa. Also they will experiment with the 
archivelets to publish field work and use the export/import feature to deliver delayed 
results from field studies where authors are away for prolonged periods. LANL is 
discussing to give researchers access to archivelets and to have “My Library” [18] 
harvest from a Kepler groupserver that aggregates the individual archivelets. 
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Abstract. This paper argues for the necessity of digital libraries to increase ac- 
cess to their holdings and have greater impact on e-leaming and education by 
facilitating the creation of secondary repositories. These repositories will pro- 
vide discipline/community specific metadata and applications and will allow 
users to find, use, manipulate and analyze digital objects more easily. To this 
end, MATRIX has developed Media Matrix 1 .0 - an online, easy to use server- 
side suite of tools that allows users to locate specific media and streaming me- 
dia files found in digital repositories and segment, annotate and organize this 
media online. This application provides users with an environment both to work 
with and personalize digital media, and also to share and discuss their findings 
with a community of users. Through creating a secondary repository of usage 
statistics and user-generated materials/metadata to supplement both traditional 
cataloging records and discipline-specific online indexes, tools like Media Ma- 
trix can help extend the usefulness of digital libraries without increasing costs 
to the libraries 



1 Introduction 

For the purposes of preservation and increased use of their holdings, libraries, ar- 
chives and cultural organizations have been researching and developing best practices 
for digitizing their analog collections. These efforts have given users unprecedented 
access to information. Scholars have long realized, however, that '‘access” to informa- 
tion must mean more than the ability for a user to link to computer networks. Under- 
lying the meaning of access in relation to digital equity and universal service is the 
need for a community of users to have the ability to retrieve information “in some 
form in which it can be read, viewed, or otherwise employed constructively” 
[2] [4] [5]. Access thus implies four related conditions that go beyond the ability to link 
to a network: equity, the ability of “every citizen” and not simply technical specialists 
to use the resources; usability, the ability of users to easily locate, retrieve, use, and 
navigate resources; context, the conveyance of meaning from stored information to 
users, so that it makes sense to them; and interactivity, the capacity for users to be 
both consumers and producers of information. While access to online resources has 
steadily improved in the last decade, online archives and digital libraries still remain 
difficult to use, particularly for students and novice users [1]. 

While access to digital resources has had positive affects on both scholarly research 
and teaching and learning at all levels of instruction, digital libraries must take the 
next step and redefine access in ways that help users to use digital objects. To this 
end, MATRIX has developed Media Matrix 1.0 - an online, easy to use server side 
suite of tools that allows users to find, segment, annotate, organize, and publish digital 
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media found on the Internet and in digital repositories. This application provides users 
with an environment not only to work with and personalize digital media, but to share 
and discuss their findings with a community of users. Because Media Matrix stores a 
significant amount of information about the digital objects selected by users and user 
generated annotations per digital object, it both provides a corpus of data on how 
digital repositories are being used and creates materials that augment traditional cata- 
loguing records. In so doing, it forms a secondary repository that holds metadata gen- 
erated by its users, additional resources for its users, specialized searches and galler- 
ies, extended materials, and pointers to digital objects in primary repositories. Thus 
the value of the application is that it can enhance the usability, access, and interactiv- 
ity of digital libraries by facilitating the creation of secondary repositories on top of 
their collections without significantly increasing costs and time needed to prepare and 
maintain additional resources. Digital libraries can also utilize, if desired, usage statis- 
tics and user generated materials/metadata to supplement traditional cataloging re- 
cords and applications. 



1.1 Spoken Words, Digital Libraries and Users 

Even though access to digital objects has grown at an exponential rate, tangible fac- 
tors have prevented users from fully taking advantage of these resources. We have a 
long history of working with texts and are comfortable moving through texts, making 
annotations, summaries, and quotations. Beyond a host of traditional methods, we 
have at our disposal sets of tools, both freeware and commercial, which help us to 
cite, catalogue, and annotate texts ( e.g ., Endnote, Procite, Biblioscape). Streaming 
media, however, is another matter. Scholars and students, beyond some specialized 
areas, rarely have worked in the past with media and have often preferred the tran- 
script over the original. While contemporary bibliographic tools have expanded to 
allow users to catalogue and keep notes about media, they do not allow users to mark 
specific passages and moments in multimedia, segment it, and return to specific 
places at a later time. While several initiatives and products (e.g., Annotea, SHOE 
Knowledge Annotator, NetSnippets) allow users to point to specific online materials 
or portions of online materials and add their annotations to those pointers, it does not 
allow users to work with the non-textual, digital media present on those pages. Mul- 
timedia thus remains underutilized in education because the tools to manipulate the 
various formats often “frustrate would be users” and take too much cognitive effort 
and time to learn [3]. 

Over the past decade, the digital library community has tried to reduce the labor 
and expense of creating, cataloging, storing, and disseminating digital objects through 
the research and development of specific practices to facilitate each of these stages. 
Although these processes have become easier, better documented, and more auto- 
mated, creating and working with digital objects is still a very specialized endeavor 
that requires specialized hardware, software and expertise - often outside the realm 
and resources of the general user. Even with the resources to work with digital media, 
copyright restrictions and streaming technologies make it difficult for users to 
download, manipulate, and use digital objects in their own practices. 

To compound this, traditional cataloging and dissemination practices often make it 
difficult for users to locate and utilize digital objects within the framework and prac- 
tices of their discipline [10]. Digital objects are typically cataloged to describe their 
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content (bibliographic information), composition (technical metadata), maintenance 
(administrative metadata) and dissemination (rights metadata and any information for 
delivering the object via online applications). While these practices are essential for 
preserving the digital object and making it available to users, the practices also make 
it available to users in a language and guise that is often difficult to understand within 
the context of use [10]. 

While the author’s name, the title of the work, and keywords are essential for de- 
scribing and locating a digital object, this kind of information is not always the most 
utilized information when users are looking for and ascertaining the relevance of a 
digital object. K - 12 teachers, for instance, often do not have specific authors or titles 
in mind when searching for materials for their classes. They more frequently search in 
terms of grade level, the state and national standards that form the basis of their teach- 
ing, or broad, overarching topics ( e.g ., core democratic values or textbook topics) that 
tend to retrieve too many search returns to make the information of value. Although 
the addition of more discipline specific information at the object level would open up 
the digital libraries to larger constituency and enhance the impact and usability of 
digital libraries, it would be a huge and unrealistic endeavor for digital libraries given 
the multitude of disciplines that would benefit from the addition of discipline specific 
information attached to each object. 

The keys to making better use of multimedia in education and to enhancing the use 
of multimedia for specific contexts and disciplines, are to build secondary repositories 
with resources and tools that allow users to enhance and augment materials [11], 
share their work with a community of users [14], and easily manipulate the media 
with simple and intuitive tools (or at least build interfaces that match existing, well- 
known and heavily-used applications). Users will also need portal spaces that escape 
the genre of links gateways and become flexible work environments that allow users 
to become interactive producers [8]. In short, secondary repositories are the result of 
users integrating digital objects into their research and work in ways that make sense 
to them given their backgrounds and tasks. 

Herbert Van de Sompel has proposed a successful system (OpenURL/SFX frame- 
work for context sensitive reference linking) for disaggregating reference linking 
services from e-publishing [15]. In his framework, the service of providing links be- 
tween references and across e-publishers’ digital repositories is separated from the 
services provided by the e-publishers. In so doing, the service provides “seamless 
interconnectivity between ever-increasing collection of heterogeneous resources,” 
freeing primary repositories from the difficult and expensive task of ensuring links to 
references while giving users greater access to resources and increasing the value of 
the digital object [16]. Similarly, we propose the creation of secondary repositories 
that would be responsible for handling secondary metadata, extended materials and 
resources, and interactive tools and application services. Generated by interactive, 
online tools, these resources would work to contexutalize, add meaning, and provide 
new ways of discovering the original digital object. Primary repositories would con- 
tinue to be responsible for preservation, management, and long-term access but would 
be freed from creating time-consuming and expensive materials, resources, services, 
and extended metadata for particular user groups. 
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2 Media Matrix 

MATRIX - The Center for Humane Arts, Letters, and Social Sciences Online is a 
humanities computing research center based at Michigan State University. Over the 
last five years, MATRIX has participated in the ongoing discussion and development 
of digital library practices and has built a large-scale digital repository. This digital 
repository holds over fifteen collections that contain a diverse range of materials from 
different disciplines - from images of quilts to nineteenth-century renderings of tu- 
mors, to recordings of indigenous practices of West African tribes-people, to the in- 
terviews of the oral historian. Studs Terkel. The variety of these materials has drawn a 
diverse crowd of users who come to the sites from different disciplines and with 
vastly different agendas. MATRIX’S research agenda - initially under a five-year 
National Science Foundation Digital Libraries II grant (1998) to develop a National 
Gallery of the Spoken Word - has focused on how best to build the infrastructure of a 
spoken word repository and the best practices for digitizing and disseminating the 
objects within repositories. While this research has been very successful and reward- 
ing, the focus of MATRIX’S research agenda has shifted - under the Spoken Word 
Project funded by Digital Libraries Initiative II: Digital Libraries in the Classroom 
Program, National Science Foundation in conjunction with UK’s Joint Information 
Systems Committee - in part, toward how to best make digital objects useful to digital 
library users, especially for education and e-learning. 

The Spoken Word project focuses on helping to transform undergraduate learning 
and teaching through integrating the media resources of digital repositories into un- 
dergraduate courses in history, political science and cognate disciplines in the U.S. 
and Britain. The project takes advantage of the flexibility inherent in digital reposito- 
ries to build processes for learning that will expand how students and teachers under- 
stand knowledge, knowledge resources, and their own complementary roles in higher 
education. The project is a collaboration of Michigan State University, Northwestern 
University, the National Archives and Records Administration (NARA), Glasgow 
Caledonian University, and the BBC - Information & Archives. Project researchers 
are testing whether and with what effect the integration of digital audio resources into 
university courses achieves four major project outcomes: (1) improving student learn- 
ing and retention, (2) developing aural literacy in our students, (3) augmenting student 
competence to write on - and for - the Internet, and, (4) enhancing digital libraries 
through a focus on learning. 

Research on these areas has led to the development of an application called Media 
Matrix (version 1.0) - an online, server side tool that helps users to find, segment, 
annotate, organize, and publish streaming media found on the Internet. 



2.1 Media Matrix Tool Set and Operation 

A user begins using Media Matrix 1 .0 by registering for a user account at the Media 
Matrix web site. In the process, users complete a short profile that describes generally 
their teaching and scholarly backgrounds. Users are then issued an account that gives 
them access to the Media Matrix tools and a personal portal page for gathering, orga- 
nizing, and publishing the materials they create and gather. Users are also given the 
ability to create groups to which they can invite other users to join. The group func- 
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don is a key element of the tool set because it allows users to share resources and 
collaborate on the development of resources and projects with other members of the 
group. Teachers can thus create a group for each of their classes and invite students to 
join that group. This allows both the teacher and students to preview the work of and 
collaborate with other members of the class easily. 




Fig. 1 . Media Matrix user portal page. 

Media Matrix does not require any special downloads or plug-ins, a feature that 
makes the tool more accessible to teachers, students and researchers, who may be 
working in computer labs and at library work stations that often do not easily allow 
for the downloading of additional software. Focusing on maintaining a familiar work 
environment, Media Matrix works within the browser of the user, and works with the 
same media players normally used to play digital objects. Users continue to use Me- 
dia Matrix by dragging five links (favlets) provided on their portal page to the book- 
mark bar of their browser (see Figure 1). Users can then search for objects on the 
internet using their own methods and preferred tools or go directly to sites where they 
want to work with digital media. When users find a digital object that they would like 
to use or work with, they simply click the appropriate link and Media Matrix is 
launched. In the case of audio, the user would, for instance, find an audio clip on any 
site on the internet (e.g., American Memory, CNN, BBC, or ESPN). They would then 
press the “Find Audio” link saved on their bookmark bar. Media Matrix then uses 
regular expressions and string matching to isolate any audio files referenced on the 
page and loads the sound into the editor. If multiple sounds are on the page. Media 
Matrix has been designed to allow users to preview and select the sound they want to 
load into the editor. Because there are virtually any number of ways a digital object 
can be embedded and described on a web page. 
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Fig. 2. Media Matrix segmenting in online editing controller window. 

Media Matrix can only reliably identify the URIs of media found on a specific web 
page. The streaming media is then loaded into the appropriate media player (Real 
Player, QuickTime, Windows Media) and embedded into the Media Matrix online 
editor (see figure 2). Taking advantage of existing resources on users’ computers and 
working with formats supplied by repositories, Media Matrix allows users to employ 
common players to control the playback of the audio. While the audio is playing, 
Media Matrix permits users to “record” portions of the streamed clip. When the user 
finds a portion he/she believes is important, he/she simply clicks the “Start Re- 
cording” button and then the “End Recording” button to capture a segment of the 
sound. Media Matrix does not actually record the audio, but instead stores the URI of 
the clip and the time offsets for that portion of the clip selected by the user. When the 
user replays the clip those offsets and the URI of the clip are then passed back to the 
player and thus only the selected portion of the streamed audio is replayed. 

After segmenting the sound file and isolating the portion(s) of the streaming clip 
users want, users can then add their own thoughts or analysis to the clip in the form of 
annotations. The user then titles the clip/annotation and submits it to his/her personal 
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portal page. The annotations can then be easily saved, accessed, combined, exported 
organized, edited, shared, and published. 

Using the same basic steps, Media Matrix works with other media types. Users can 
work with video in much the same way as audio, allowing students and researchers to 
isolate portions of the video and add an annotation to the sections of the clip that they 
have selected. Media Matrix also allows users to bookmark whole web pages or pages 
of text or copy portions of the text into their portal as well as describe that text 
through the use of a title, annotation, and keywords. Similarly images can be cropped 
and resized by the user as well as annotated. This can be particularly effective for 
students and researchers who need to fit images into a presentation or would like to 
demonstrate specific nuances and details about portions of images or artwork. 

In the case of each media type, Media Matrix works much like standard note- 
taking and bibliographic tools but gives users greater control over manipulating the 
media and maintains users’ work in an online (and optionally collaborative environ- 
ment). Once users find, segment, and annotate streaming media, they can then organ- 
ize those entries on their personal portal page. The portal page allows users to create 
trees of meaning and organization through the use of nested folders. Users have the 
ability to display the contents of each of their folders to particular groups they have 
joined, to the general public by removing any access restrictions, or maintain the 
resources for personal use only. Once they have organized the media that they have 
collected, they can integrate that media into multimedia publications. 



2.2 Media Matrix Delivery Presentation Layer 

Users can choose from a number of presentation templates that allow them to select 
digital objects from their portal page (audio and video segments, images and image 
selections, text, and annotations) add text and analysis, and submit for publication. 
This creates a web page presentation with a persistent URI that features the writing of 
the user and the digital clips he/she has selected. This is an especially important fea- 
ture requested by instructors because it allows teachers and students both to make 
presentations in the classroom and to create multi-media essays for submission. 

The suite of tools developed for Media Matrix not only allows users to work with 
and analyze digital objects, but also affords users the ability to locate new digital 
objects. Media Matrix uses the information users submit when creating their profiles 
and groups to create browse-able and searchable access points to portal pages created 
by other users. Historians, for example, can browse the portals of other historians 
working specifically in their research areas or K-12 teachers can browse grade appro- 
priate sections defined by specific grade levels and subjects to see what digital objects 
other teachers are using or, more important, for time challenged teachers, they can 
find specific presentations created around standard topics and curriculum frameworks. 

Users can also perform keyword searches over the annotations created by all users 
or specific groups of users. A teacher, for instance, can choose to search through only 
the information in eleventh-grade Civics groups in hopes of finding information that 
speaks directly to his/her needs. Because users have gathered content from across the 
Internet and from a variety of digital repositories, searching Media Matrix is equiva- 
lent to searching multiple repositories at once. Once users find an object from a par- 
ticular digital library, they can jump to that repository to find what other objects are 
available. 
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2.3 Media Matrix and Metadata Augmentation 

One way to alleviate the high costs of augmenting metadata is to create a distributed 
model of augmentation. Not unlike the model for the development of open source 
software, digital libraries can rely on communities of users to develop the accessibil- 
ity and usability of their collections in a secondary repository. Along these lines, Me- 
dia Matrix allows users to create rich sets of discipline specific user generated meta- 
data. The segments, presentations, and annotations created for each digital object can 
serve to augment the original finding aid for the digital object. It also supplies infor- 
mation not only about the popularity and relevancy of specific resources, but also 
about most used segments and the content of files. 

Collections can also benefit by defining communities of users. For example, with 
the recent release of secret White House tapes [7], the sheer number of tapes and 
hours make it impossible for adequate cataloging of content as well as the difficulty 
of determining the context and people involved (or even what is said given the poor 
quality of many tapes). Those historians and scholars (a more regulated and highly 
defined set of experts) allowed access to the collections could use Media Matrix to 
supply information about content and context as well as set terms for debates over 
more questionable areas of interpretation (e.g., when sound quality makes passages 
inaudible). While metadata gathered in these ways would need to be qualified (main- 
tained in a secondary repository) because of lack of quality control, the processes 
could make large quantities of sound more available and usable (as well as searchable 
since annotations will be keyed to particular time offsets). 

Because users can search directly using the Media Matrix environment, MATRIX 
also plans to give any digital library and online sound collection open access to its 
logs (those logs that apply to the specific collection). A digital library can export from 
Media Matrix any usage statistics and information about the specific digital objects 
users are accessing in their collections. This information can provide digital libraries 
with information, i.e., who is accessing their holdings, which objects are being ac- 
cessed and in what portions of those objects are users most interested. MATRIX is 
planning on using this information to build for users dynamic recommendation lists 
based on other users’ preferences (e.g., Amazon). In doing so, we can search Media 
Matrix and find users who have annotated specific objects and then suggest other 
segments and objects in that content folder of their portal page, a service that would 
greatly enhance the usability while helping to augment context for digital objects. 



2.4 Media Matrix Programming Environment 

Media Matrix is a PHP based server side application that stores information in a 
MYSQL database and exports that information into XML for display. The develop- 
ment of the tool and programming environment have been designed to keep it library 
and archive independent so that it can work with almost any site on the internet. It can 
also work easily with any of the standard courseware packages. The tool is also search 
independent because it relies on traditional internet search tools and a site’s discovery 
tools to find an object. Once objects are found. Media Matrix is deployed by the user. 
Because Media Matrix does not actually copy the digital object from the site (it only 
stores a pointer to the object in the form of a URI and whatever time offsets are cre- 
ated by the user), it avoids some of the copyright and fair use pitfalls that often keep 



Media Matrix: Creating Secondary Repositories 337 



users from working with digital objects (although there are issues of deep linking to 
be addressed). 

The continued development of Media Matrix faces several challenges. Media play- 
ers and browsers are central to the development of the application. Media Matrix, as 
noted, uses both browsers as its native environment and media players to stream me- 
dia because users are comfortable with these environments and it does not necessitate 
further software installation. The most popular media players - Real Player, Quick- 
Time, and Windows Media Player - all have APPs that allow information to be 
passed between the player and the browser. Because of this, clip information such as 
time parameters can be grabbed from the player to record time offsets or passed back 
to the player to play only the portion of a clip. The most common method of doing 
this is with JavaScript. Although this is an adequate development platform given 
some OS environments and browsers, it is highly problematic in others. 

Players have a troubled history of playing clips formatted for other players (al- 
though several claim to allow trouble-free playing of a number of formats). Real Au- 
dio does not play QuickTime files, for instance. Because of this, code must be written 
to identify and interpret the kinds of files that are being edited by Media Matrix and 
then separate code must be written for each player to pass information back and forth 
to the browser. Although JavaScript is a popular, standardized scripting language, not 
all browsers interpret it in the same manner, particularly event functions. Because of 
this, the full functionality of Media Matrix remains limited to Internet Explorer 6.0 
and PCs. It remains functional in Netscape and other browser environments but cur- 
rently has limited functionality on Mac OS X. Further research and development 
should allow us to expand its use and functionality in terms of players, browsers and 
platforms. MATRIX has considered employing Macromedia’s Flash as an environ- 
ment for Media Matrix and will continue to do so, but Flash also has limitations in 
working with particular players and browsers, especially Real Audio in its native 
environment. 



2.5 Media Matrix Beta Testing 

Steve Cohen of Tufts University completed an initial assessment of using digital li- 
braries and Media Matrix in a classroom setting. For the assessment, a survey history 
course of 150 students at Michigan State University, History 203, which covered 
twentieth-century American history, was chosen. A sample of 40 students in three 
different sections of the course was given written surveys to complete and of these, 1 0 
students were interviewed in more detail. No controls or quasi experimental protocols 
were used, so results can only be considered as trends and will be used to design more 
comprehensive assessments for the Fall Semester of 2004 and to do initial usability 
revisions of Media Matrix. 

In sum, students were positive about their experience with both digital libraries and 
Media Matrix. Most students visited between 2 and 5 digital libraries to review and 
acquire sources for their essay assignment and spent totals of 4-8 hours on task 
(low=2, high=20). Most reported that, based on their experience in History 203, they 
would be more motivated to enroll in a course that used digital libraries and Media 
Matrix, and. most reported that the images and audio helped to improve their work, 
the material “came alive.” 
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2.5.1 Digital Libraries Usability. Overall students would like more digital library 
resources given to them that work for the assignment and work with Media Matrix. 
They would like the search capabilities of digital libraries improved with more ad- 
vanced features, annotations, thumbnails, topic searches, and easier access. Often 
students were not clear about what they meant by easier access and better navigation 
and searches, but they found digital libraries hard to use because it was difficult for 
them to sort through the resources or find resources that fit their topics. Often they 
noted that they would like materials to be better annotated and organized by topics. 
As Cohen noted, the problems with digital library interfaces were “not an issue of 
technical skill but rather design and informatics. During the group interview students 
suggested that the DLs did not have good interfaces for browsing, but seemed to be 
designed for users who already knew what they were looking for.” 

2.5.2 Media Matrix. The suggestions for Media Matrix fell into three main catego- 
ries: 

1) Improved instructions: the instructions need to be written for those not familiar 
with technology, step by step, and placed at point of need. 

2) Improved presentation layer: students would like to be able to have more format 
controls over text (“to work like a word processor”) and do in-text citations; they 
wanted to create a more professional looking essay. 

3) Easier to use: what was meant by easier to use was more difficult to define but for 
the most part, students wanted Media Matrix to work with more kinds of file for- 
mats, digital library sites, and operating systems; and they wanted more informa- 
tion on the pop-up window that listed resources found on a web page (if more than 
one source was found on the page) to identify the resource. 

Other significant comments focused on creating a FAQ; starting with a smaller as- 
signment (students were asked to write a 2500 word essay using text, images, and 
sound) or shorter sequenced assignments; students would also like to add resources to 
their resource tree without leaving the presentation layer. 

Based on the surveys and interviews, in addition to improving the usability of digi- 
tal libraries and Media Matrix, Cohen has initially concluded that students need more 
help with understanding the value and use of sources that they do find. One way we 
will attempt to do this will be to increase the number of fields that students will be 
required to complete for an annotation, allowing students, for example, to note who, 
when, where and the motives of the creator(s)/participant(s). Students should also be 
given space and prompts to evaluate the source and its context. These and several of 
the other above suggested improvements are being incorporated into the tool set. 



3 Conclusions 

Digitizing collections and putting them online provides new and unprecedented ac- 
cess to information, but to have an impact on e-learning and education, it is no longer 
enough for digital libraries to stop at search and browse. Digital libraries must take 
the next step and help users employ those digital objects in ways that make sense to 
specific user communities and academic disciplines. Whether it is the addition of GIS 
information and the creation of GIS aware applications that let historians view objects 
in time and space, or search interfaces that take into account the pedagogical envi- 
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ronment of educators and help them find resources to utilize within specific assign- 
ments, or applications that bring content alive for students through 3D rendering and 
role play, digital libraries must work closely with users to discover the unique per- 
spectives they bring to the site and build applications that bring digital object alive 
within those worlds. 

Given the present budget crises and the costs and time associated with digitizing 
materials and managing digital repositories, it is often not feasible for digital libraries 
to offer extended services. Creating secondary repositories that can make use of a 
number of collections and focus on the needs of particular user groups (especially for 
e-leaming) makes more sense for users and digital librarians. Correctly deployed 
secondary repositories, created from user generated data with specific applications, 
can increase visibility and accessibility of existing collections and thereby help digital 
libraries and archives to cultivate the full meaning of access: equity, usability, con- 
text, and interactivity. 

Although development of applications to work with more browsers and players is 
necessary. Media Matrix 1 .0 has proven highly successful in its first run. It has been a 
marriage that has proven fruitful for both users and digital libraries. Users have a way 
to utilize and personalize digital objects and digital libraries have access to a wealth 
of information that can be tied to the digital object. Creating Media Matrix has helped 
us to redefine the term access and to imagine a more flexible and interactive work 
space for scholars and students. This is particularly crucial when it comes to sound 
archives: if we do not enhance the value of sound for users and increase the demand, 
then our sound archives will continue to languish in neglect and decay. 
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Abstract. We have produced a system that automatically incorporates syndi- 
cated materials from sources including library acquisition records and online 
news sites to form growing hypertextual structures. This system enables users 
to create personal and shared collections built atop a growing substrate. It also 
seeks to empower users through the use of information filters to create dynamic 
personal collections that can themselves grow over time to include materials as 
they appear within the underlying collection. In addition, we are investigating 
particular benefits of intersecting hypertextual paths as a useful structure for 
representing such sub-collections and the resources extracted from the feeds 
themselves. We present our prototype system, the emerging standards for syn- 
dicating online content, and a discussion of the importance of supporting 
growth within digital libraries generally. 



1 Introduction 

Although often useful, the tendency to view digital collections as static is seldom true 
to the model of physical collections. Libraries, for instance, continuously acquire new 
materials and such growth over time provides a better model for both digital and 
physical collections [18]. Users and maintainers of digital collections require tools 
built to anticipate and accommodate patterns of continuous growth. As users increas- 
ingly rely upon expanding collections of electronic resources they require advanced 
filtering techniques to easily locate materials suited to their needs. We have created a 
prototype system to investigate methods of assisting users to select, manage, and 
share materials gathered from growing collections. 

Studies of patrons work practices in libraries have emphasized the tendency of us- 
ers of physical libraries to create personal sub-collections that are often shared. Ex- 
amples include students’ annotated texts or a knowledge worker’s notes becoming 
reference materials for her colleagues [2, 19, 20]. Furthermore, individuals benefit 
when their personal collections are interrelated, tying together resources (e.g., memos 
and email messages) that are generally separated in today’s computing environments 
[9]. Motivated by these observations, our system enables users to create dynamic 
personal collections that intersect with other users’ collections into a richly associa- 
tive linked lattice of meta-documents. We have previously investigated its use with 
networked news feeds from the Web [8] and in this paper concentrate upon its use 
with bibliographic information produced by a physical library. 
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Our prototype system relies upon the recent emergence of standardized digital 
formats for the syndication of materials commonly referred to as RSS feeds (“Really 
Simple Syndication” or “RDF Site Summary” [32, 33]). These formats provide an 
XML-based mechanism for encapsulating metadata in a standardized and harvestable 
manner. The prototype builds upon our earlier work with the Walden’s Paths system 
[25] to enable users to create, maintain, and share Web-based materials. We have 
adapted our Walden’s Paths tools to automatically harvest information from syndi- 
cated data feeds and to support building personal collections atop this growing set of 
resources. In cooperation with the Texas A&M University System’s libraries we are 
incorporating metadata about all new library acquisitions into this system. We have 
also been harvesting news feeds from 30 different news-related Web sites and lexi- 
cally analyzing the articles appearing within them. 

This paper proceeds as follows: in the next section we discuss related work, the 
evolution of the RSS standard for syndicating information online, and our Walden’s 
Paths system. The subsequent section provides a high-level discussion of our ap- 
proach to supporting personal collections built atop growing information substrates 
and scenarios of use. Section 4 describes the architecture of our prototype and is fol- 
lowed by a discussion of our work to date with the system. We conclude in section 6 
with the directions in which we aim to extend this work in the future. 



2 Related Work 

Many researchers have explored how individuals seek, organize, and share informa- 
tion from physical collections. While the environments of digital and physical collec- 
tions differ, this work has largely sought to identify processes intrinsic to information 
seeking. Across the various studies users were found to employ idiosyncratic and 
individual methods but also to demonstrate a basic propensity for using personal notes 
and reference lists in information seeking. Such lists were useful not only to their 
authors but provided a vantage into an individual’s needs, activity, or interests. 

O'Hara, et al., [20] studied the work practices of graduate students performing re- 
search in libraries. The students relied upon hand-written notes, primarily biblio- 
graphic references, to assist them in returning to found information and for future 
reference. Bishop’s study [2] of researchers’ work and communication practices ex- 
amined their use of structured documents, particularly scientific journal articles. Most 
interviewees depended upon personal procedures in their work practice, but all shared 
a tendency to craft transitional documents and to use annotation to filter documents. 
Crabtree, et al., [5] performed an ethnographic study of library help desks. Library 
patrons were found to often be incapable of expressing their needs clearly. When the 
patrons brought lists of materials with them, staffers could use those lists to establish 
the patrons’ context of investigation. Kracker and Pollio [17] apply content analysis 
and phenomenological interpretation to undergraduate students’ rhetorical descrip- 
tions of their experience of libraries. They find that library patrons are often over- 
whelmed by the sheer scope of available materials in their first encounter with re- 
search libraries. Patrons often adapted by locating and subsequently centering them- 
selves within specific locales that had been previously perceived as containing rele- 
vant materials. 
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Marshall [19] provides an ethnographic study of a university library’s use of meta- 
data to describe a collection of physical artifacts and digital materials. Her research 
helps outline the dimensions of the problem space that collection maintainers face in 
creating metadata for collections comprised of heterogeneous materials. In her analy- 
sis of a study of the cataloging codes and conventions used in U.S. libraries, Tillett 
[28] establishes 7 classes of linkage appearing between bibliographic items. The spec- 
trum of categorizations she uses encompasses all possible relationships between simi- 
lar materials in a collection. She explores the current practice, at the time, of librarians 
in recording relationships between materials and her findings emphasize the impor- 
tance of accommodating variable linkages between catalogued materials. 



2.1 Personal and Shared Digital Libraries 

The Berkeley Digital Library’s Personal Libraries system [30] enables users to create 
collections built from materials extracted from a document collection. UpLib [16] 
provides a thumbnail visualization interface for searching and browsing the docu- 
ments and images that comprise one’s daily life, such as receipts, articles, and photos. 
Salticus [3] develops a predictive model of a user’s interests during her navigation of 
the Web based upon structural features of pages and the user’s actions. 

Geisler, et al., [13] propose bringing the concept of special collections from physi- 
cal libraries into digital libraries. They discuss the benefits accruing to both users and 
collections from enabling “virtual collections”. MiBiblio [10, 23] provides users with 
personal spaces in which to store information found while navigating digital library 
collections. Materials can be placed into public categories for inclusion into others’ 
spaces according to personal characteristics - thus enabling professors, for instance, 
to place items onto a virtual reserve for students in their classes. Their system also 
seeks to provide transparent access to multiple digital collections through standard- 
ized information federation protocols. Robertson, et al., describe [24] a Web-based 
interface for a corporate research library system. Reference librarians shared the re- 
sults of their information gathering sessions and questioners could view their earlier 
interactions. Patrons and librarians were able to comment upon, upload, download, 
and email the results of their earlier activity. 



2.2 Information Filtering 

In contrast, Foltz and Dumais mention how, in their examination of systems for in- 
formation filtering, information filtering is hardly a novel concept [12]. People per- 
form filtering when subscribing to particular magazines or watching certain television 
channels. Belkin and Croft explain the difference between information filtering and 
information retrieval [1], Distinguishing characteristics of filtering systems are: that 
they deal with unstructured or semi-structured data; particularly data that appears in 
continuous streams; and that they tend to focus upon users’ long-term or repeated 
interests. 

Personal collection systems of the sorts described in the preceding section have 
generally been oriented toward the extraction of materials from within static collec- 
tions. The systems generally classified as information filtering systems work upon 
streams of information, an approach better suited for use within continuously growing 
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collections. Several researchers have applied filtering techniques in seeking to help 
users to manage various computer-based information forms that generally undergo 
continuous growth or change. 

Gifford's Semantic File System (SFS) [14] and Gopal and Manber’s FIAC (“Hier- 
archy and Content'’) systems [15] provide dynamic access to file system objects based 
upon “virtual” files and folders encapsulating specific selection criteria. Other sys- 
tems have applied filtering techniques to computer mediated communication such as 
Usenet and electronic mail, including INFOSCOPE [11], Phoaks [27], and Informa- 
tion Lens and GroupLens [22]. Largely as a result of the dearth of widely adopted 
standards for describing Web-based materials, such as Fedora [31], few systems have 
applied filtering techniques to the Web. 



2.3 Syndicated Feeds and the RSS Standard 

In 1999 the Netscape Corporation created a portal site providing customized news 
feeds to its users. To facilitate extracting and presenting news from different sources 
they used an XML-based format called RSS (RDF Site Summary) [33] designed to 
encapsulate resource description framework (RDF) metadata [32]. The format was 
kept intentionally simple and lightweight, initially containing little more than a head- 
line and URL to some Web-based material. This ease-of-use gradually led to its adop- 
tion by other sites as a method for summarizing newly available materials. 

Netscape ultimately lost interest in both their portal endeavor and the RSS format, 
but work upon the format continued. In subsequent years the initial (version 0.9) for- 
mat would evolve into a set of related standards for presenting syndicated metadata 
online. Although the majority of these standards retain the name RSS (versions span 
from the initial 0.9 through 2.0) a newer variant is called ATOM [34], The authors of 
these later standards retained the inclusion of RDF metadata and subsequently incor- 
porated the Dublin Core Metadata Initiative’s elements [35]. Although originally 
developed to encapsulate news articles, with the emergence of the Web log (or blog) 
phenomenon syndicated feeds have begun to gain far more widespread popularity and 
usage. 



2.4 Hypertextual Paths and Walden’s Paths 

Hypertextual paths find their first mention in the well known paper by Vannevar 
Bush, “As We May Think.” [4] Bush proposed hypertext paths as a means for associ- 
ating conceptually related but physically separated items within an information space. 
The Walden’s Paths system [25] is a suite of tools that supports the creation, presenta- 
tion, and maintenance of hypertextual paths inspired by such associative trails of 
found knowledge. Our work has largely focused upon the use of paths built from 
Web-based materials in educational environments [26]. In our system individual paths 
contain pages from the Web that all share some common topic, supplemented by 
author-provided annotations, to form coherent narrative presentations. 

Paths provide readers with both local and global contexts of their constituent in- 
formation elements. Each item in a path is implicitly related to every other item by 
merit of inclusion within the path and the ordering of elements may be used to express 
some narrative purpose. Web syndication feeds similarly present ordered lists of ma- 
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terials and are themselves a form of hypertextual path structure. Paths derived from 
these Web feeds provide an ideal mechanism for describing hypertextual paths that 
continuously grow over time. 

The items selected by human path authors are chosen for their explicit relevance to 
some context. With Web feeds the common context derives from their common point 
of origin, although more complex associations can also exist. Pages appearing in the 
syndicated feed for the computer news site Slashdot.com generally deal with com- 
puters, and those appearing in the feeds from ABCNews.com’ s entertainment and 
politics sites will deal with those respective topics. The elements of syndication feeds 
are ordered chronologically, meaning that individual items are unlikely to be directly 
related to their immediate neighbors other than by happenstance. 

Elements of hypertextual paths can also provide points of intersection between the 
different paths containing them, providing jumping off points from one narrative or 
contextual thread to another [7]. Recently we have begun investigating whether im- 
plied intersections between materials within paths can assist with knowledge man- 
agement and information seeking. We are harvesting RSS feeds from thirty Web sites 
and applying semantic similarity techniques to the linked pages to deduce the general 
topic of individual pages and automatically generate implicit virtual paths between 
related pages [8]. In this way, hypertextual paths evolve the information space from 
one comprised of individual, isolated, collections to one that embodies an information 
collective. 



3 Approach 

In addition to harvesting articles from online news feeds we are investigating uses of 
library bibliographic records. We have begun to import a syndicated feed containing 
materials newly acquired by the Texas A&M University Libraries every day. These 
libraries, along with over 250 other libraries in 7 countries, currently use Michael 
Doran’s New Books software [38] to provide an OP AC interface to new acquisitions. 
We adapted the data files used with the New Books software to generate a daily RSS 
feed; we are investigating the use of interconnected paths as a flexible interface to 
such information. While the number varied daily, our libraries acquire, on average, 
over a thousand new items every week; consequently the RSS feed represents a re- 
source that is both quite sizable (perhaps too sizable for consumption via available 
RSS browsers) but also one that is of immediate value to scholars on our campus. 

We have modified our Walden’s Paths system to automatically retrieve and incor- 
porate materials from online syndicated feeds. For Web-based feeds, including those 
from news feeds, we perform semantic analysis and key phrase extraction over the 
full resource as a means of supplementing the metadata provided by the feed itself [8]. 
The metadata associated with library records is richer than that provided with Web 
feeds, but the content linked to library records is not as informative as that associated 
with Web feeds. Consequently, for the library feeds, we focus our processing on 
ingesting the metadata of interest. Once processed, library records are collected into a 
path that is otherwise treated identically to any other path. We are interested in the 
potential for such a system to allow users to select elements from among the various 
feeds, particularly through the specification of dynamic information filters. Such an 
approach would enable materials not yet available at the time of collection creation to 
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be automatically included into personal collections as they arrive, for long-term col- 
lection growth. 

Consider a researcher looking for materials from her local research library. She be- 
gins by applying filters to the feed of library acquisitions, requesting only materials 
with the Library of Congress classification for Computer Science (QA76) and the 
subject heading “Wireless Communications.” She further limits the selection to in- 
clude only materials that arrived after a specific date or within a date range. To ensure 
that she does not retrieve the same materials over and over, she might also constrain 
the display to include items never previously viewed. This set of criteria can be saved 
as a dynamic path which might, for instance, stay empty until some relevant resources 
are acquired by the library or a pertinent event appears in the news feeds. 

Consider next a professor who might select individual elements from the library 
path, or even the previously described researcher’s path, for inclusion into one of her 
personal paths. She can then insert items from the ACM digital library [36] perhaps 
along with documents such as a Web page describing her course’s grading require- 
ments or assignments. In this fashion, her path forms the reading list or syllabus for a 
seminar she teaches. Her students might create sub-paths interconnected to their class’ 
path to point to their individual summaries or assignments for each reading. 

In this fashion we hope to exploit the attractive combination of interconnected hy- 
pertextual paths and information filtering to facilitate sharing and organizing informa- 
tion within growing collections. Personal, dynamic paths can be used to passively 
collect information over time as with filtered news feeds that grow to include only 
articles meeting some criteria. The intermixing of references between physical mate- 
rials and purely digital information may also produce an opportunity to study the 
differences in individuals’ uses of each. 



4 Architecture 

Our Walden’s Paths system has been modified to automatically incorporate syndica- 
tion feeds and to allow filtered navigation of paths. The prototype interface works 
within most recent Web browsers and operates in two modes: the Path Server and the 
Path Publisher. The former provides a browsing interface for paths: users may view 
lists of publicly available paths, including those shared by other users in the system, 
and specify filters to their browsing. Users can, optionally, log into the system to 
enable additional functions, such as filtering based upon whether materials have pre- 
viously been seen, saving elements into personal paths, and saving filter criteria into 
personal dynamic paths. In the Path Publisher perspective logged-in users see an 
overview of all of their personal paths, can create new paths, add new syndicated 
feeds to the system, edit their paths, and publish their private paths to enable others to 
view them. 

The system currently distinguishes between two types of feeds: Web-syndication 
feeds such as news sites, and bibliographic library feeds. In the Path Publisher inter- 
face a user can provide the location of an available feed in one of these two formats 
via an URL. The system attempts to retrieve a feed from that location to ensure that it 
exists and is in an understood format. Because generating feeds requires processing 
on the part of their originating servers the user must specify the interval between suc- 
cessive attempts to update a feed. For library materials this value is currently once 
daily (due to the frequency of updates from our library), for Web feeds values be- 
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tween hourly and monthly are allowed. All syndicated feeds are explicitly public (and 
not owned by any particular user) and the system ensures that different users do not 
request the same feed to avoid duplication. Every hour feeds due to be harvested are 
retrieved and compared with their previous contents. New items are identified and 
appended to the path corresponding to that feed. 

In a single month fifteen feeds from online news sites provided roughly 15,000 
items. As new Web pages are identified, the system adds them to a queue of pages 
pending textual processing. A distributed set of systems periodically selects pages 
from this list to be downloaded and processed for identifying key phrases and the 
extraction of similarity metrics. We use an approach akin to that described by Phelps 
and Wilensky to extract key phrases to characterize the contents of these pages [21]. 
The pages are tagged to identify parts-of-speech and we use one, two, or three term 
phrases in a manner informed by Turney’s approach, modified to use available hyper- 
text markup cues [29]. Our work in extracting key phrases was initially developed as 
part of our work in locating replacement pages for Web pages that have disappeared, 
and is described in more detail elsewhere (see [6] and [8]). 

In the same month, the campus library feed added 4,656 items to the collection. 
The feed contains several elements in addition to the title, abstract, and link appearing 
in most Web sites’ RSS feeds. Each item in the feed possesses an URL to the library 
OP AC’s Web page for that item, which provides up-to-date status information such as 
whether that item is currently checked out. The feed also provides any available Li- 
brary of Congress subject headings and the call number assigned to the materials, 
from which we extract Library of Congress classifications [37], Additional informa- 
tion provided by the library includes location, a description field providing media- 
specific information such as physical dimensions, number of pages (for printed mate- 
rials), and information about copyright and publisher. All of the information provided 
by the library is concatenated into a textual annotation block although a subset of this 
metadata currently is used as criterion for filtering. 

Figure 1 shows the path selection mode of the Path Server interface to our proto- 
type. In the figure the user is in the process of specifying filters that will define a 
membership function for a dynamically generated path. In the upper left portion of the 
figure a new window has opened within which the user is in the process of specifying 
that matching entities must belong to a particular Library of Congress classification. 
In addition, in the upper portion of the main window, the user had previously speci- 
fied that all matching items must have been added after the 15 th of March, must not 
have been previously seen, and should be associated with the key term “wireless”. 
Key phrases are matched against both items originating from Web-feeds and library 
materials, with Library of Congress subject headings used for the latter. In this in- 
stance, however, because items must also have a specific Library of Congress classi- 
fication code only bibliographic material will be included. Having specified con- 
straints, the user can next either apply them to one of the available paths (two are 
listed in the lower portion of the figure), apply them to all available paths (via the link 
to generate a dynamic path), or, because she had previously logged in, save the con- 
straints to create a new dynamic path. 

Figure 2 shows the browsing interface after the user has generated a path using the 
constraints shown in the previous figure. The Web browser window has been split 
into two panels with the lower one displaying the linked materials, in this case an 
overview page from the campus library, and the upper providing navigation and con- 
trol options. To the right of the upper panel is an annotation, in this case it lists all of 
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Fig. 1 . The Path Selection Interface: Specifying Constraints and Creating a Dynamic Path 

the metadata provided by the feed with this item. To the right a set of icons provides 
the user with the ability to scroll between the available items, moving the mouse over 
each numbered icon displays the title of that element. The number of items contained 
in this path is listed to the right of the right-scroll icon, in this case eight items 
matched. There are also back and next buttons to linearly traverse the included items, 
alternatively the user can scroll to a particular item, or enable a “table of contents” 
perspective to see all items by title and navigate to specific items. In this figure, the 
user has opened the configuration menu (shows in the upper left) by clicking on an 
icon seen at the extreme left of the upper panel. This menu provides her with the abil- 
ity to save the current page to one of her personal paths, to switch to the “table of 
contents” view, list other paths including this item, and to make change to the infor- 
mation filtering criteria currently in use. This interface is discussed in greater detail in 
([8] and [25]). 

5 Discussion 

The additions to Walden's Paths provide two services that are imperative for digital 
libraries that gracefully accommodate the growth of their collections over time. It 
helps users to identify relevant new materials from amidst the greater set of all avail- 
able items, much like the “new titles” shelves in a video store. It also helps users to 
create and, importantly, share personal collections. As the literature of information 
seeking in libraries and amongst knowledge workers demonstrates, individuals rely 
upon personal and idiosyncratic techniques for keeping track of and returning to 
found information. By allowing users to flexibly organize information, to readily 
locate new relevant materials, and to unify formerly separated classes of information 
streams, our system seeks to empower users in their use of digital collections. 
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Fig. 2. The Path Browsing Interface: Configuration Menu Displayed 

Through the use of information filtering techniques upon Web-based and biblio- 
graphic materials, our system emphasizes the necessity of adapting technology to 
accept the inevitable growth of collections over time. Users of our system can save 
their requirements as a set of constraints which will resolve upon access to contain 
relevant materials from within a collection. Rather than requiring users to perform 
complex search and subsequent filtering operations, these dynamic paths are treated 
as formal, first-class objects in our system. Beyond providing basic awareness possi- 
bilities (“tell me when such and such a book is published”) our system lets a user 
create a growing personal collection of information to be referred to and used as a 
perusable and harvestable resource on its own (“show me every article the New York 
Times has published about ‘Haiti’ and ‘Aristide’ since March 2004 that I have not 
previously read.”) 

Our work in cooperation with the campus library is partially motivated by their de- 
sire for better tools for gauging patron’s wants and interests and for sharing informa- 
tion with their patrons. Subject specialist librarians often lack for a powerful tool with 
which to notify their audience of newly acquired materials, one of their most basic 
work duties. With our system for publishing syndicated content and for sharing paths 
built atop library bibliographic records we provide them with one possible approach 
for doing so. Furthermore, as a technology that enables users to build knowledge 
structures and narratives that contain references to both online resources and proxies 
to physical and library materials, patrons can use our system to create library wish- 
lists and reading lists. Through this process a library may gain a better understanding 
of what resources are most desired by their patrons and make more user-focused pur- 
chasing decisions as a result of this feed back. 
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While hypertextual paths are a relatively well established concept within academic 
circles, the development and evolution of syndicated feeds is a new phenomenon 
worthy of study. Syndicated feeds closely approximate hypertextual paths but raise 
interesting questions as a result of their intrinsic tendency to change and grow over 
time. As with blogs, which are increasingly related to syndicated feed technology, the 
use of these emerging technologies within digital libraries and as mechanisms to as- 
sist in managing electronic collections is a field with much promise for future work. 



6 Future Work 

There are a growing number of libraries using syndication formats to inform patrons 
about new acquisitions. Our current focus is on how the combination of information 
filters and personal selection can support collections in the form of hypertextual paths. 
As we continue our partnership with our campus library, we intend to create a system 
that can be iteratively designed and developed to produce a useful tool for their pa- 
trons, resulting in opportunities for real-world evaluation. Over time we hope to 
evaluate the intersection between physical and purely digital objects and their use in 
our system, and to seek additional sources for hybrid resources. We also intend to 
continue our investigation of techniques for extrapolating relationships between Web- 
based materials on the basis of topicality and semantic similarity, assisted by user 
input and explicit path navigation and authoring behavior. 
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Abstract. This paper presents a new technique for supporting query formulation 
and processing experimentally integrated in the OpenDLib search service. This 
technique provides a better support for unified search by enhancing the capabil- 
ity of the digital library to satisfy the user needs. The paper presents the theory 
underlying the proposed technique and describes how it has been exploited in the 
OpenDLib system. 



1 Introduction 

Digital libraries (DLs) are often built by re-using and integrating information sources, 
originally created by single institutions to serve their own purposes. Each institution de- 
scribes its documents using specific cataloguing rules. Even when a standard metadata 
format is used, the semantic interpretation of the metadata fields and the cataloguing 
terms used are strongly influenced by the assumptions and terminology of the appli- 
cation context in which the institution operates. The content acquired by a DL from 
different heterogeneous sources can be used to serve a multitude of users coming from 
institutions that have not necessarily contributed to provide this content. The different 
cataloguing rules used at the source level are completely transparent to the DL users, 
who formulate queries that express their information needs in terms of the metadata 
format and controlled vocabularies supported by the DL search service. 

This dichotomy between the information source cataloguing environment and the 
search environment complicates both the formulation and the processing of user queries. 
As in the DL framework the users neither know how documents have been originally 
described nor have access to the original description format, they are not always able 
to formulate precisely the conditions required to retrieve documents that satisfy their 
needs. Most DL search services attempt to minimize this problem by automatically 
expanding the user query with the help of stemming and query expansion algorithms. 

In order to process the user queries the system must be able to map the query con- 
ditions against the descriptive metadata of the documents provided by the different in- 
formation sources. The most common solution implemented today to carry out this task 
is to enforce interoperability by requiring to every DL information source provider to 
expose the descriptions of their documents in at least a shared common metadata for- 
mat. This format is usually also the one accepted by the DL search service language. 
In order to fulfill this requirement, the source provider establishes a mapping between 
its internally used metadata format(s) and the mandatory metadata format and then it 
applies this mapping to all the metadata records of its resources. The DL search service 
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thus operates in a context where the metadata descriptions and the query language are 
homogeneous and can process the query with traditional techniques. 

Current DL systems support both query formulation and processing using tech- 
niques based on syntactic manipulations, without exploiting any semantic information 
about the metadata schemas and controlled vocabularies. One of the reasons for this 
choice is the lack of techniques for exploiting it successfully. 

This paper presents a new technique for supporting query formulation and process- 
ing that uses such semantic information. This technique, which has been experimentally 
integrated in the OpenDLib search service, takes advantage of the specialization rela- 
tionships among the metadata fields and among the terms of the controlled vocabularies 
used. This information is obtained by exploiting the translation relationships that are 
produced by the information source providers when they transform the local description 
formats into the common format. This information, usually discarded, is semantically 
richer than the final common format and can be used for building more powerful search 
services. The OpenDLib search service is thus able to offer the choice among a range 
of possible different interpretations for the same query and the users can select the one 
that better satisfy their needs. Note that this technique does not require any explicit 
generation of the metadata records in a pre-defined shared format. 

This paper presents also the model that formally justifies the proposed technique. 
This model modifies the work on ontologies presented in [13] in order to apply it to the 
DL framework that is characterized by both metadata schemas and controlled vocabu- 
laries. 

The outline of the paper is as follows: next section discusses the limitations of the 
current search services that exploit only syntactic relations; Section 4 and 5 justify the 
proposed technique formally; Section 6 describes the application of this model in the 
OpenDLib framework; finally. Section 7 concludes. 



2 Motivation 

In experimenting DLs built by re-using content from heterogeneous sources, we have 
often encountered situations in which the users could not formulate queries that express 
their needs and the system was not able to process them properly. These observations 
have motivated this work. This section gives examples illustrating some of problems 
we faced. 

Let us consider a simple DL in which the provider of the information source I Si 
publishes the following metadata records: 





Subject 


Subject.ACM 


docl 


text processing 


unspecified 


doc2 


unspecified 


1.7.1 Document and Text Editing 



According to the internal rules of the DL institution, the authors can describe their doc- 
uments by assigning either a code extracted from the ACM Computing Classification 
System to the field Subject. ACM or a free term to the more generic field Subject. The 
records produced are processed by the system in order to extracts the information re- 
quired to process the user queries. 
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Imagine now that the user John Smith wants to retrieve exactly those documents 
that have been described with Subject equal to “text processing”. The trivial solution is 
to formulate the following query: “Subject = text processing”. The search service has 
only to match the query condition against the information extracted from the metadata 
records and it usually replies including docl and excluding doc2. 

Consider now another user of the same DL, Henry Stamp, who is interested in 
retrieving all the documents about the topic that his community of interest refers as 
“text processing”. Using a traditional search service, this user cannot do anything better 
than formulate the same query as that expressed by John Smith. However, the result 
expected in this case is different. It should include: i) the documents retrieved under 
the previous more strict interpretation; ii) the documents whose Subject contains values 
morphologically and syntactically close to the query term, e.g. “textual processing” and 
“documents and text processing”, and Hi) the documents whose more specific subject, 
i.e. Subject.ACM, contains values that are semantically close to the query term. Under 
this interpretation the system should, therefore, return not only docl, but also doc2 
since its more specific subject field, Subject.ACM, contains 1.7.1 “Documents and Text 
Editing” which is an ACM subcategory of 1.7 “Documents and Text Processing”. 

While the majority of DL search services that support an interpretation of the query 
based on automatically extracted morphological and syntactic relationships, e.g. stem- 
ming and query expansion, are able to return the documents described in i) and ii) 
above, they are not capable to exploit the semantic relationships that exists among the 
different concepts represented by the metadata fields. This means that the current search 
services do not usually return documents, like doc2 , which are indexed under metadata 
fields that are specializations of those indicated in the query, i.e. Subject.ACM. 

Despite this example may seem very trivial, it must be remembered that in order 
to satisfy the requirements of the second user, the query must find doc2 which has 
been classified using a narrower subject field but a broader classification term. When 
manipulating complex metadata formats and sophisticated categorization schemas this 
kind of document identification is not a simple task. 

The limitation described above becomes more incisive in DLs composed by multi- 
ple information sources, each describing its documents with different metadata formats. 
In order to achieve search interoperability over a set of information sources, DLs often 
require them to publish their metadata in a shared format, e.g. Dublin Core (DC) [1 ]. 
To adhere to the rules of the DL, each information source provider maps its local for- 
mat into the shared format. This mapping is done locally by people that have a clear 
understanding of the semantics associated with the original metadata fields. This infor- 
mation is never transmitted to the DL system that only receives the metadata records 
in the shared format. The query interpretation made by the system is thus defined with- 
out taking into account the local descriptive interpretations. This behavior negatively 
influences the quality of the DL search service. 

To exemplify this point, let us add another information source, IS 2 , to our example. 
It maintains a set of audio- video (A/V) documents of university courses described as in 
the following example: 



I || CourseArea \ CourseTopic \ AudioVideoSubject | 

\tloc:l || Computing Methodologies | Text processing | Document Management] 
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where AudioVideoSubject is the subject of the A/V document, i.e. the subject of a spe- 
cific course lecture, CourseTopic is the topic of the course, and CourseArea is the course 
research area. Following this semantics, the A/V document, being a course element, 
which belongs to a specific area, is also implicitly classified under the subject of the 
course and the subject of the area. 

Suppose now that DC is the common metadata format. The institution that maintains 
I Si maps both Subject and Subject.ACM into dc: subject, whereas the institution that 
maintains IS 2 maps only AudioVideoSubject to this field. Under this hypothesis, any 
query interpretation provided by the search service is unable to return doc3 as a result 
of the query presented at the beginning of this section even if the query term exactly 
matches the subject of the course whose the video is a part of. 

The situations exemplified above, and many others, convinced us that the search 
functionality implemented so far by DLs are too strict. Search services that better satisfy 
the user needs must be provided. We propose an approach, which can be implemented 
with reasonable costs, able to exploit, as far as possible, the existing semantic mapping 
about the document description terminologies. 

3 The Architectural Framework 

Figure 1 shows the logical DL architectural 
framework that we assume. The content of the 
DL is given by a number of independent het- 
erogeneous Information Sources IS\, . . . , IS n 
that disseminate metadata records in one or 
more formats. These records are indexed by 
Index services. Moreover records of different 
ISs in different formats are indexed by sepa- 
rate Index services 1 . An Index thus processes 
queries formulated according to the same ter- 
minology, i.e. metadata format and controlled 
vocabularies, used for the indexed records. 

This terminology and the corresponding semantic descriptions are known to the Index, 
i.e. it has access to the schema that specify the metadata format and the controlled vo- 
cabularies associated with the metadata fields. Moreover, we assume that all the Index 
services accepts the same query structure and relational operators. 

An Index service supports different interpretations of the same query condition. 
Each interpretation is characterized by a different level of precision given to the con- 
dition. For example, the different intended semantics given by John Smith and Henry 
Stamp to the query “ subject = text processing ” are two different interpretations of this 
condition. 

The DL user queries are actually not directly evaluated by the Index services but are 
first processed by the Query Mediator (QM) service. This service hides the heterogene- 
ity of the underlying information space serving search operations in terms of the query 

1 This assumption is only given for simplicity of exposition, it does not compromise the gener- 
ality of the solution. 




Fig. 1 . The architectural framework. 
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terminology shown to the user 2 . It first maps the user’s queries into queries formulated 
in the terminology of the underlying information sources, then it dispatches them to the 
Index services and, finally merges the results received. The mapping is done by exploit- 
ing the knowledge of specific semantic relationships between the handled terminology 
and the local indexed terminologies. These relationships, defined by the IS providers, 
are stored by the corresponding Index services 3 . The QM, similarly to the Index, can 
support different mapping modalities, the user choose that to use. 

The next two sections introduce the theory that justifies our approach. Follow- 
ing [13], we propose a formalization that applies to the DL framework which has to 
do with metadata schemas and controlled vocabularies. In particular, we specify the 
different query interpretations that can be supported by the Index and QM services and 
how they are obtained by the existing terminology mappings. More details can be found 
in [5]. 

4 The Index 

Each information source uses a meta- 
data schema to describe its own docu- 
ments. This metadata schema is a pair 
(J-, <p), where T is a set of schema 
fields and <? is a subsumption rela- 
tion over T 4 that models the existing 
specialization relationship among 
these fields. For example in Figure 2, 

Subject. ACM <r Subject means that 
Subject. ACM is a more specialized 
property than Subject. Each field/ of the schema is populated via an appropriate termi- 
nology defined as a pair (V/, <v f ) where V/ is a set of terms and <y f is a subsumption 
relation over V/ that models the existing specialization relationship among these terms. 
For example, in Figure 2 Multimedia DL < v DL means that Multimedia DL is a more 
specialized term than DL. In certain cases the latter assumption is too strong. A field 
is often populated via free terms or free text. In these cases, the terminology can easily 
and automatically be obtained considering that each term is in relation only with itself 
or, if we are going to use stemming, we can assume that the term is subsumed by the 
stemmed term. 



2 A DL can also offer search operations defined on more than one terminology. This situation 
can be handled by introducing a QM for each of these terminologies. 

3 Protocols, like OAI-PMH, require that any IS provides at least a common DC metadata de- 
scription of its items. In order to adhere to this protocol, each IS provider must first define the 
mapping between its local metadata format and DC, and then generate the DC records. Our 
approach is less demanding, it only requires the mapping and does not need any explicit record 
generation. 

4 Each subsumption relation < is a rejlexive and transitive relation over the reference universe. 
We write oi ~ 02 meaning that the two objects are equivalent w.r.t. < if both 01 < 02 and 



Research Area Description 





Library 



Subject. ACM Audio.Subject Senlice System Multimedia DL 

Metadata Schema Terminology 

Fig. 2. A metadata schema and a terminology. 
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Combining the metadata schema with the set of terminologies V/ 5 that the Index 
uses, one for each held of the schema, we can define the query terminology that the 
Index “speaks” as a pair (C, <c), where C is a set of conditions (/, v) such that / € T , 
v £ Vf. This models the boolean condition “held / equals term v”. For example, a 
valid condition for the Index in Figure 2 is ( Subject , Digital Library ) representing the 
information need “the documents whose Subject is Digital Library”. 

The subsumption relation <c models the specializations among conditions and is 
formally dehned as follows: 

Definition 1. [Subsumption relation] Let (IF, <r) be a metadata schema, (V/, <y f ) 
be the terminology for the schema field f. Given ci, C 2 £ C where c* = ( f , , vf), fi £ T 
and Vi £ V/ t we define c\ <c C 2 fi fi A v\ = t> 2 - 

Figure 2, for example, says that (Audio. Subject, Library ) <c ( Research Area, Library) 
and that ( subject.ACM , DLSS ) <c (Subject, DLSS) meaning that the first condition is a 
specialization of the second one. 

A query for the Index is either a simple condition or a combination of conditions 
using the boolean connectives A, V, For example, a simple query can be (subject. 
Digital Library) V (Description, Library). 

Definition 2. [Interpretation] An interpretation I of a query terminology C is a func- 
tion I : C —r 2° b i that associates each condition of C with a set of objects of the 
domain. 

Each Index has an interpretation I that is the result of the indexing phase. Table 1 in 
columns I presents an interpretation of the Index presented in Figure 2 6 . 

The interpretation that an Index uses for query evaluation must comply with the 
structure of the query terminology (i.e. <c)- This requirement is expressed by introduc- 
ing the notion of model. 

Definition 3. [Model] An interpretation I is a model of a query terminology ( C , <c) if 
V c,c' £ C, c <c d => 1(c) C 1(d). 

For example, suppose that an Index has indexed a set of documents under the condition 
Ci and another set of documents under the condition C 2 and no documents under the 
condition c that subsumes the previous two conditions. This interpretation is acceptable 
as we can “respect” the structure of <C by defining the interpretation of c as the union 
of the set of documents indexed under ci and those indexed under C 2 . 

As there may be several models of C, we assume that each Index is able to process 
queries from one or more models of its interpretation. In this paper, as suggested in [13], 
we will consider two families of models for query processing, the sure evaluation mod- 
els and the possible evaluation models. In order to define these models formally we 
need two preliminary definitions that allow us to follow the subsumption relation, re- 
spectively, over the fields of the metadata schema and over the controlled vocabularies. 

Definition 4. [Tail and Head] Given a condition c £ C, c = (/, v), we define 
tail(c) = {d £ C\d <c c} head(c) = {d £ C\c <c d} 

5 We will use V/ instead of (V/ , <v f ) when no confusion arises. 

6 For simplicity, we will use the same terminology to populate all the schema fields. 
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Intuitively, tail(c) and head(c) contains c and, respectively, all the conditions that are 
stricter than c and wider than c according to the query terminology and, in particu- 
lar, to the subsumption relations over the schema fields. For example, considering Fig- 
ure 2, tail{subject, DL)={(subject, DL), {subject. ACM, DL), (Audio. subject, DL)} while 
head(subject, DL)={(subject, DL), (Research Area, DL), ( Description , DL)}. 

Definitions. [Value models] Given an interpretation I of C and a condition c £ C, 
c = (/, v), we define three kinds of value models for c generated by I as follows: 

I~(c) = U {A c ')l/ = /' At/ ~v/ w} !<{c) = U {I{d)\f = f At/ < V/ v} 

I>(c) = fl {-f<( c ')l / = /' A v < Vf v' A v oo Vf ^/} 

The above interpretations correspond to three different ways in which the Index can 
evaluate a condition that involves the field / using the stored interpretations and the 
semantic information on the controlled vocabularies. These interpretations correspond 
to the set of documents indexed under conditions involving the field / and, respectively, 
the value v or values equivalent to v (if), the value v or values subsumed by v (If), 
and all the values that subsume v (/>). 

We can now define the sure evaluation model and the possible evaluation model of 
the stored interpretation I. These are obtained by taking into account both the subsump- 
tion relations among the schema fields and the subsumption relations among terminolo- 
gies. 

Definition 6. [Sure and Possible models] Given an interpretation I of C we define 
three types of sure evaluation models (If ) and three types of possible evaluation models 
(If ) of C (where “* ” stands for ~ | < | >) generated by I as follows: 

If (c) = u {I* ( c 0 W e tail(c ) } If (c) = fl {If (c') \c' G head(c) A c' no c c } 

Table l 7 shows the sure evaluation models of our Index that use the terminology in 
Figure 2, based on the stored interpretation I. 



Table 1 . Interpretations of an information source index. 



Condition 


I 


IZ 


1 VI 


IV | 


(Subject, Digital Library) 


m 


{1.2} 


{U3T 


{1, 2,3,4} 


(Subject, DL) 


121 


{1.2} 


{1,2,3} 


{1, 2,3,4} 


(Subject, Info. Sys.) 


151 


{4.5} 


{1,2,3, 4, 5} 


{1,23,4,5,6} 


(Subject, Library) 


161 


{4,6} 


{1,2, 3, 4, 6} 


{1,2,3, 4, 5, 6} 


(Subject. ACM, DLSS) 


131 


{3} 


{3} 


{3} 


(Audio.Subject,Info. Sys.) 


141 


{4} 


{4} 


{4} 


(Audio.Subject, Library) 


141 


{4} 


{4} 


{4} 


(Research Area,DL) 


m 


{1,2,7,10} 


{1,2,3,7,8,10} 


{1,2,3,7,8,9,10} 


(Research Area, DLSS) 


181 


{3,8} 


{3.8} 


{3,8} 


(Research Area, Info. Sys.) 


191 


{4,5,9} 


{1,2,3,4,5,7,8,9,10} 


{1,2,3,4,5,6,7,8,9,10} 


(Research Area, Library) 


191 


{4,6,9} 


{1,2,3,4,6,7,8,9,10} 


{1,2,3,4,5,6,7,8,9,10} 


(Research Area, Dig. Lib.) 


1101 


{1,2,7,10} 


{1,2,3,7,8,10} 


{1,2,3,4,7,8,9,10} 


(Description, Multimedia DD 


181 


{8} 


{8} 


{1,2,3,7,8,11} 


(Description, DL) 


17.111 


{1,2,7,11} 


{1,2,3,7,8,11} 


{1,2,3,4,7,8,9,11} 


(Description, Info. Sys.) 


191 


{4,5,9} 


{1,2,3,4,5,7,8,9,11} 


{1,2,3,4,5,6,7,8,9,11} 


(Description, Library) 


{9} 


{4,6,9} 


{1,2,3,4,6,7,8,9,11} 


{1,2,3,4,5,6,7,8,9,11} 



7 In this table we have used i referring to di. 
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Even if the indexing phase is correct, certain documents may not have been indexed 
under all the conditions that could apply to them. So, given a simple query c, we may 
want the source to be able to answer including either all the documents that are known 
to be indexed under c or all the documents that are possible indexed under c. In the first 
case we want the sure evaluation model while in the latter case we ask for the possible 
evaluation model. 

Definition 7. [Sure and Possible Query answering] Let q be a query over C and let 
I be an interpretation ofC. The sure answer If (q) and the possible answer (q) are 
defined as follows: 

/<(c) = (J {/<(c')|c' G tail(c)} 7<(c) = f] |/<(c , )|c' G head(c ) A c' ^ c cj 
I< (q A q') = If (, q ) D If (<?') 1$ (q A q') = (q) n it (q') 

I< (q V q') = If (q) u If W) It (q V q') = it (q) U it (</) 

qhq) = q(q) qq q ) = it(q) 

All the other sure and possible answers for the other models, i.e. If, If., It an d />, 
are defined in a similar way. 

Each of the above query answering modes represent a modality of query processing. 
Note that the sure answer is appropriate for users that focus on precision while the 
possible answer is for users that focus on recall. Moreover, in both the family of sure and 
possible answers, we can distinguish more precision-oriented responses, i.e. If, versus 
more recall-oriented responses, i.e, If . An Index that stores an interpretation, like the 
one given in Table 1, and that has access to the semantics of the metadata schema and 
its controlled vocabularies, can thus potentially offer a range of other interpretations, 
like the ones given in the same table, to any of its clients to express their information 
needs more precisely. 

For example, expressing the query ( Subject , DL ) user could be interested in doc- 
uments that have been described using the field Subject, or a more specialized one, 
and the term DL or an equivalent term, so this user is asking for If . Another user ex- 
pressing the same query could be interested, instead, in those documents that have been 
described using the field Subject, or a more generic field, and the term DL or an equiv- 
alent term, so this user is asking for /+. In the case of Table 1, the Index returns the 
set of documents {di, c^} to the first user and the set of documents {di, d 2 , d-j} to the 
second user. Note that while d\ and d 2 are indexed under the condition ( subject , DL) 
and ( subject , Digital Library) respectively, the document d 7 is indexed under a pair of 
conditions, ( Research Area, DL) and ( description , DL), more general but still pertinent 
to the one expressed by the user. 

5 The Query Mediator 

The previous section has described which are the potential query evaluation choices 
of an Index service that exploits semantic information. We can now examine the more 
general problem of understanding which query evaluation choices can be supported by 
a Query Mediator service. In what follows we will assume that such kind of mediator 
dispatches queries to Index services that behaves as described in the previous section. 
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Abstractly a QM service can be considered as an Index service that virtually stores 
all the objects of the underlying sources and supplies a query language that satisfies 
the needs of its users community. However, there is an important difference between a 
QM and an Index: the QM does not store explicitly any interpretation of the information 
space. Such interpretations are maintained by the Index services. The QM only stores an 
articulation for each source, i.e. a set of relationships among the Mediator terminology 
and the Index terminology. A QM is formally defined as follows: 

Definition 8. [Query Mediator] A QM over n Index services such that 

h = (Cj, <cj, consists of: 

1. a quay terminology (Cm, <c m ) an d 

2. a set of articulations ai, one for each Index f; each articulation a,i is a subsumption 
relation over Cm U Ci which contains: 

- a subsumption relation, fif, over T 1 ' 1 U T" 1 , i.e. a set of relationships among 
the Mediator metadata schema and the Index metadata schema , 

- a set of subsumption relations, , over V^ 1 U Vj,, i.e. a set of relations 
among each field terminology of the Mediator and the corresponding ones in 
the Index. There exists one of such relation for each pair of (Mediator field 
terminology, Index field terminology). 



We introduce a special subsumption relation between Mediator and Index field termi- 
nologies, Ilf, to indicate that every term of the first terminology is mapped into the 
same term of the second terminology. In such case we impose that Vf = Vf 1 , i.e. the 
terminology of the Index is the same as that of the Mediator, and is defined such 
that for each v £ Vf, and v' £ Vf 1 , v ~y^ v' if and only if v = v' , i.e. the term on the 
Mediator is equivalent to the same term of the Index w.r.t. the articulation. 

The Mediator query terminology 
is defined similarly to the Index one, 
i.e. Cm is a set of pairs (/, v) such 
that / £ T m , v £ Vf 1 , and <c M 
is a subsumption relation over Cm- 
Moreover each V is a terminol- 




ogy, i.e. a pair (Vjr ,<V/) where 
<v f is a subsumption relation over 

vf. 

Figure 3 shows an example of a 
QM that operates over two Indexes. 

This mediator uses the DC meta- 
data schema and the ACM Comput- 
ing Classification System as control- 
led vocabulary for the field subject 8 . 

The Index services in Figure 3 are Inde x\, that has been introduced in the previous 
section, and IndeX 2 , an Index service that uses the LOM metadata schema [2] and free 
terms to populate the fields shown in the figure. The query interpretations supported 



Fig. 3. A Query Mediator over two Indexes. 



For brevity, the example shows only a partial view of the Query Mediator. 
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by the QM are defined in terms of both the interpretations stored by the Index services 
and the existing articulations. In order to identify these interpretations we show how the 
mediator proceeds in order to reply to a query: 

1. define a query & for /, as a translation of each c € Cm obtained using a t , i = 

2. evaluate c® at /*, i = 1, . . . , n; and finally 

3. define 1(c) as the union of the answers to c l returned by the Index services. 

Several possible translations, considering the semantic relationships among QM and 
Index terminologies, can be identified. We define (for details see [5]) precise, lower 
and upper approximations of a conditions c, £ Cm- Roughly speaking, the first one, 
c®^, is the disjunction of all the conditions in Cj that are equivalent to c, in aj ; the 
second one, c\ , is the disjunction of all the conditions in Cj that c, subsume in a 3 ; 
while the last one, c l u , is the conjunction of all the conditions that subsume c, in a 3 . 

Examples of approximations for the QM shown in Figure 3 9 are: 

(DC. subject, H. 3. 7)* = (subject, Digital Library) V (subject, DL) 

(DC. subject, H.3. 7)f~ = (1.5,H.3.7) V (9,H.3.7) V (9.1,H.3.7) V (9.2,H.3.7) 

The approximations are just queries to the information source ,Sj and can have sure 
or possible answers as shown in Section 4. For this reason we can define at least 54 
possible interpretations I for the QM 10 , denoted with /„ /, where a is the type of QM 
approximation and b is the answer type from the source, e.g. I u< +< means that the 
mediator uses the upper approximation with <, while the sources reply following the 
possible model . These approximations are defined as the set union over the source 
interpretations w.r.t. the mediator approximation, e.g. I u< +< (c) = U"=i ^t< ( c m< )• 

As the QM is an IS it can give either one of the three sure answers or one of the three 
possible answers for each of the above interpretations, i.e. we can have 324 possible 
modes under which the mediator can operate. These operation modes are denoted with 
I' a b where a is the type of mediator approximation, b is source answer-type and c is the 

mediator answer-type, e.g. Ijjjj , means that the mediator use the upper approximation 
with < and reply following the possible model with > while the sources reply following 
the possible model . 



6 The Enhanced OpenDLib Search Service 

The approach described theoretically above has been exploited for building a more 
advanced search service for the OpenDLib Service System [7]. 

The OpenDLib architecture is very similar to the logical one described in Section 3. 
Its search service does not support any subsumption between attributes of the metadata 
format and assumes the standard subsumption relation between terms and their stems. 
Moreover, the search functionality over heterogenous metadata formats is supported 
thanks to a common metadata format. 

9 We used the code of the fields/terminology terms instead of the value when no confusion arise. 

111 For simplicity we assume that all the Indexes respond using the same type of answer. 
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One of the on-line DLs powered by the OpenDLib software is called tLibrciry. It 
manages documents harvested from different ISs. Some of these sources represent their 
content using DC, others use the qualified version of this format, others apply propri- 
etary metadata descriptions. The different semantic interpretations of the same metadata 
fields and the presence of a variety of field qualifiers reduce the quality of the search 
functionality when heterogeneous information sources are selected by the user, even 
when all the different metadata descriptions of the content are indexed. 

To overcome this problem we decided to design an experimental search service 
fully based on the illustrated techniques. We needed to i) easily drive users in querying 
both homogeneous and heterogenous information sources; ii ) simply present how to 
ask for a more precision-oriented, or recall-oriented, query evaluation; and Hi) hide the 
complexity of the proposed approach. 

Taking into account that our harvested information sources have not used controlled 
vocabularies, and therefore was not possible to identify subsumption relations between 
values, we decide to maintain the support of the standard subsumption relation between 
terms and their stems. Moreover, we decide to only support the 77/ approximation, i.e. 
we chose to simplify the approach of the users with the system loosing the exploita- 
tion of the relation among different controlled vocabularies, e.g. the Dewey Decimal 
Classification (DDC), the Library of Congress Classification, etc. 

The resulting search service is based on two relation operators 11 , literal and contain, 
and two search functionalities, simple and cross-schema. 

The simple search functionality supports query requests on homogeneous informa- 
tion sources. It allows to choose between two possible query interpretation models, sure 
and possible. This means that, for each query, users can now specify the personalized 
recall that they think is needed to satisfy their needs. For example, the user John Smith, 
who is confident that he is interested only in documents that are classified exactly with 
token “text processing”, can specify the query as “subject literal text processing”. The 
second user, Henry Stamp, who searches for documents about the same token but does 
not know how they have been classified, can ask for an interpretation of the query that 
also takes into account documents that are classified under the semantically special- 
ized “subject” held. i.e. he can select the sure interpretation that will return also doc2. 
Finally, we can consider a third user, who want to retrieve documents about “digital 
libraries”, clearly focusing his interest on recall, can specify the query as “subject con- 
tain digital libraries” and select the possible interpretation, implicitly asking for a /> 
query answering. In order to implement this functionality the Index service has been 
enhanced to support the sure, Iz (c) and and possible, 7+(c) and 7>(c), eval- 

uation models described in Section 4. Preliminary tests demonstrate that we can best 
manage held qualihers using the sure evaluation model if the query has been expressed 
on a held that supports qualihers, and the possible evaluation model if the query has 
been expressed on a qualiher of a held. 

The cross-schema search functionality supports query requests on heterogenous in- 
formation sources. It allows to choose between three possible query interpretation mod- 
els, precise, lower, and upper, that indicate the type of approximation the system apply 
to navigate heterogenous metadata schemas. This means that, for each search request, 

11 Using relation operators, the user specifies how the system must interpret the query tokens. 
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users can now specify how the system, using the relation among the different metadata 
schemas, must reformulate the user query. Clearly, the lower is more precision-oriented 
while the upper is more recall-oriented. In order to implement this functionality we en- 
hance the QM service to ingest the mapping schema, which contains the definition of the 
non-trivial articulations between metadata schemas, and to support the precise, lower, 
and upper approximations as defined in Section 5. In particular, we verify the benefits 
in query processing where the QM applies lower approximations asking the Indexes to 
use the sure evaluation models, and where the the QM applies upper approximations 
and Indexes use possible evaluation models. 

These restrictions on the set of possible combinations mean that a user of tLibrary 
can only ask for six possible interpretations of the query on heterogenous information 
sources and four possible interpretations of the query on information sources that use 
the same metadata schema. Nevertheless, from the user’s point of view, the appropriate 
use of these personalized search evaluations makes it possible to improve recall without 
losing search precision. 

We are now working to identify other combinations between approximations and 
query evaluation models that could help users to satisfy their needs without increasing 
too much the complexity of the interaction between users and system. We also plan 
to support the articulation between terminologies to offer a second generation search 
service over metadata schema and ontologies. 

7 Conclusion 

We present a new approach to query formulation and processing experimented in the 
OpenDLib. This approach exploits subsumption links among metadata fields of differ- 
ent metadata formats, and among the terms of controlled voacabularies. 

Much work, especially in the area of information retrieval, has been done in order 
to better satisfy the search requirements of the user. Our technique is not intended as 
an alternative to the current well consolidated search processing techniques, but as a 
complementary one. Its implementation can be embedded in a conventional framework. 

One objective was to come out with a low-cost solution. Our solution requires in- 
formation source providers to specify only the mapping between their local document 
description metadata fields and the metadata fields of the QM service. Unlike other 
approaches does not require the generation of descriptive records in a shared format. 
Moreover, we expect that in a next future its cost can be further decreased with the 
advent of new techniques for ontologies (semi-)automatic mappings. 

The complexity of our Index and QM services partly depends on the number of 
query processing options that are supported. Certainly, some of them are intuitively 
useful, while others have only a theoretic value. We have exploited only few of them. 

The fuller exploitation of semantic information in query processing is not only use- 
ful to enhance the quality of the search service, but also to improve the quality of any 
other service that queries the DL. For example, it can be useful for a service that pro- 
vides a virtual view of the DL collections or for a recommender service. One of our 
next steps will be certainly to study the impact that the proposed approach may have on 
the quality of these other DL services. We are firmly convinced that the exploitation of 
semantic information can have a very positive effect on these “user-centered” services. 
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Abstract. In previous work, we proposed a focus-based multi-level clustering 
technique. It consists in computing a particular clustered graph from a given 
graph and a focus. The resulting clustered graph is called multi-level outline 
tree. It is a tree whose meta-nodes are sub-sets of nodes. A meta-node is itself 
hierarchically clustered depending on its connectivity. In this paper we intro- 
duce a cluster cohesiveness measure to enhance the results of the previously 
proposed algorithm. We further propose an optimization of this algorithm to 
support fluid interaction when focus changes. Finally, we report the results of a 
case study that consists in applying the enhanced algorithm to citation graphs 
where documents are considered as vertices and citation links as edges. 



1 Introduction 

In a digital library, scientific papers are linked together by citation relations. The 
resulting graph, called citation graph, is easy to explore but usually not so easy to 
organize. Indeed, when users can simply browse papers using citation links, they 
usually feel lost after a long navigation. Keeping a synthetic view of navigation is 
challenging. Users need efficient tools to explore and organize papers [7]. 

Various approaches have been proposed to organize documents. Supervised classi- 
fication methods use training examples to sort out documents according to text simi- 
larities. Whereas unsupervised classification techniques (called clustering) try to 
discover natural clusters of documents without any prior knowledge. The aim is to 
provide an automatic organization of documents into cohesive groups (clusters) ac- 
cording to some measure of similarity. 

In this paper, after rapidly reviewing related work, we recall the principles under- 
lying our multi-level clustering method 34. Then we propose a new similarity meas- 
ure that extends co-citation and bibliographic coupling. Indeed, we consider not only 
direct citations (in links or out links) but also “paths” of citations between nodes in a 
k-neighborhood. K-articulation nodes are defined as nodes at distance k that discon- 
nect these paths. K-articulation nodes are used to introduce cluster cohesiveness. We 
further propose an optimization of our algorithm to support fluid interaction when 
focus changes. To end with, we apply our technique to a digital library of computer 
science literature: Researchlndex [12]. We think that multi-level outline trees are well 
suited to organize citation graphs according to user focus. 
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2 Related Work 

Many clustering techniques can be used to cluster documents. We rapidly review 
existing work by presenting techniques with their main characteristics. Since most 
approaches are based on similarity measures, we begin with addressing this measures. 



Text-Based or Link-Based Similarity Measures 

The most popular content-based similarity measure is cosine similarity 9. It takes into 
account the angle between two documents represented as vectors in a term space (see 
Vector Space Model 1517). Unfortunately computing this measure for each possible 
pair of documents in a given set of documents can be time-consuming. So it is usually 
devoted to only small sets of documents. Considering that documents belong to a 
citation graph, we can also use similarity measures based on co-citation or biblio- 
graphic coupling 17. Co-citation between document p and p’ is a similarity measure 
defined as the number of papers that co-cite p and p\ Bibliographic coupling is the 
number of papers that are both cited by p and p’. These measures are easy to compute 
with adjacency matrix 5. Some hybrid similarity measures exploit both content and 
link similarity 16. If the citation graph is viewed as an undirected graph we can define 
distance between documents as the length of the shortest path between them. Then, 
simple, average and complete linkages define distances between clusters. 



Clustering Techniques 

A partitioning method like K-means provides iteratively k clusters in linear time. 
However, clustering is highly dependent on k-value and initial position 15. A hierar- 
chical clustering technique usually proceeds iteratively with merging or splitting the 
most fitting clusters according to similarities (using either cosine or linkage measure). 
It provides a tree called dendrogram in quadratic time 15. Hybrid methods exploit K- 
means efficiency and hierarchical clustering quality (see Chameleon in 15). Min-cut 
techniques propose to minimize inter-cluster connectivity (the number of links be- 
tween clusters) and maximize intra-cluster connectivity 14. Fuzzy clustering is used 
when a document can belong to overlapping clusters. Self-Organizing Map (SOM) is 
a neural network based clustering technique 1 1 . 



Hierarchical Clustered Graph 

A hierarchical clustered graph is defined by a graph G=(V,E) and a rooted tree T 8. 
Leaves of T are vertices of G. Each node of T is a set of nodes of G called cluster. T 
is an inclusion tree since it describes an inclusion relation between clusters 6. 

Applying recursively a one-level clustering technique (K-means or Min cut) on 
each cluster provides a hierarchical clustered graph. In another hand, cutting a den- 
drogram for different levels of similarity also provides a hierarchical clustered graph. 
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Multi-level Outline Tree 

We proposed in 3 a focus-based multi-level clustering technique that provides a 
particular hierarchical clustered graph, called multi-level outline tree. It is a new 
structure easily displayed without edge-crossing. The next section describe the main 
principle underlying this technique and illustrates it with an example. 



3 Multilevel Outline Tree - Principles 

We presented formally in 4 a focus-based clustering algorithm that transforms an 
undirected connected graph in a new structure called outline tree. It is a tree whose 
meta-nodes are sets of nodes. We extended our algorithm in 3 to provide a multi-level 
outline tree. It is an outline tree where each meta-node is itself hierarchically clus- 
tered. We recall the main definitions and illustrate the technique with an example: 

- G = (V.E) is an undirected graph with a set of vertices (or nodes) V={v ; , 1 < i < N} 
and a set of edges E={(v ; ,Vj), 1 < i < j < N}. v ( is a specific vertex called focus. 

- d(v ; ) is defined as the distance of the shortest path between v t and v ; . 

- L m (m-layer) is the set of vertices at distance m from vp L = j v ; e V, d(v ; ) = m). 

- G k m is defined by G k m = { Vj£ V, m < d(v ; ) < m+k } = { VjE L r , m < r < m+k } . 

- Nodes Vj and w on L m are said k-relatives if there is a path between them in G k m . 

- L m is partitioned into sets of k-relatives nodes called k-clusters and denoted V k m 

- Two clusters on L ; are said k-relatives or k-linked if they contain k-relatives nodes. 
We apply our method on graph G displayed in five layers from focus a (Fig. 1). 

Layer: L! L 2 L 3 L 4 L 5 




Fig. 1 . Graph G from focus a 
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Clustering technique is presented on layer L 3 . We proceed similarly with other layers: 

- O-clusters: Nodes h and i are O-relatives since there is a path from h to i that be- 
longs to G° 3 (Fig. 1). So, they are grouped in a 0-cluster denoted by V° 3 1 (Fig. 2). 
The other nodes (i, k, 1, m) are called singleton O-clusters. 

- 1-clusters: Nodes j and k are 1-relatives since there is a path from j to k that be- 
longs to G 1 3 (Fig. 1). They are grouped in a 1-cluster V* 3 i (Fig. 2). 

- 2-clusters: Nodes j, k and 1 are 2-relatives since there is (at least) one path between 
them in G 2 3 (Fig. 1). So they are grouped in a 2-cluster V 2 3 j (Fig. 2). 

The algorithm described in 3 is computed in linear time and provides a particular 
clustered graph. We simplify the resulting view considering only links between meta- 
nodes. We get a tree of meta-nodes called multi-level outline tree where each meta- 
node is itself the root of an inclusion tree of clusters (Fig. 3). 




Fig. 2. Clustered graph 



4 Cluster Cohesiveness 

4.1 k-Articulation Node - Definition 

Let consider clusters V m and V’ on L m that are k-relatives and belong to k-cluster 
V k mn . We define S k (V m ,V’ m ) as a minimal set of nodes on L m+k that connect V m and 
V’ . Nodes in S k (V m ,V’ m ) are called k-articulation nodes between V m and V’ . They 
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are also said k-articulation nodes of V k mn . Indeed, removing them disconnect V m and 
V’ m , and split up V k mn . Note that S k (V m ,V’ m ) labels k-relation between V m and V’ m . 

For instance, V 2 , l and g are 3-relatives and belong to V 3 2 j (Fig. 2). S 3 (V 2 ., ,, g) = 
{x, z}. x and z are 3-articulation nodes of V 3 2 ,. Their removal disconnect V 2 0 , and g 
and so split up V 3 , , . x and z label the 3-relation (curved arrow) between V 2 0 1 and g. 

Now, since a k-articulation node is obviously a (k-1 (-articulation node, we can eas- 
ily define k-articulation nodes using (k-1 (-articulation nodes. 

For instance, z is a 1 -articulation node that connect s and t. It is also a 2- 
articulation node that connect V* 3 j and 1. Moreover, z is a 3-articulation node be- 
tween V 2 , j and g. 

4.2 k-Cluster Cohesiveness 

We define cohesiveness of a k-cluster using the set of its k-articulation nodes. Indeed, 
removing these nodes split up the cluster. For instance (see Fig. 2, Fig. 3) cohesive- 
ness of cluster V 3 2 j depends on its 3-articulation nodes: x and z. 

We propose to define k-articulation node density. It is an index computed by: 

cohesiveness( VA,,)- 

where N is the size of V k mn and A is the number of k-articulation nodes (if k>0) or 
the number of links between nodes (if k=0). For instance, cchesiveness(V 3 2 j) = 2/3, 
cohesiveness(V 2 , j) = 1/2 and cohesiveness/V^ 2 ) = 1/1. Note that V’ 2 1 contains two 
possible 1 -articulation nodes h and i. So the shortest path between b and c is longer 
than the shortest path between o and p. We consider that (h-i) is a double articulation 
node. Its weight is 0.5 and consequently cohesiveness(V’ 2 j) = 0.5/1. 

Many cluster validity indices may be applied 2. Cohesiveness index can be added 
as visual tips in multi-level outline tree layout: the more cohesive is a cluster the 
darker its background is (see Fig. 3). 




Fig. 3. Multi-level outline tree 
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5 Optimization Issues 

5.1 Changing Focus - Invariant Sets 

We get various multi-level outline trees depending on the focus we take. We propose 
an efficient method to recompute multi-level outline tree when user changes focus: 
First of all, we recall two definitions: 

- v is a cut-vertex if G is connected and the graph G - { v } is disconnected. 

- A biconnected component is a maximal subgraph with no cut-vertex. In fact, we 
need to remove at least two vertices to disconnect a biconnected component. 

Removing cut-vertices split up a connected graph G in biconnected components 
denoted S k (Fig. 4). Cut- vertices belong to two or more biconnected components. 

Adjacency tree of biconnected components: Biconnected components of an undi- 
rected connected graph G belong to an adjacency tree T. 

For instance, S 3 , S 2 , S 3 , S 4 , S 5 , S 6 belong to an adjacency tree T (see Fig. 4). 

Path of biconnected components: Considering foci v ; and Vj, we denote Ilfy, v-) the 
shortest path of biconnected components in tree T that connect Vj and v- . 

For instance, considering foci a and p, IT(a, p) = { S 2 ,S 3 } (see Fig. 4). 

Invariant sets: Multi-level outline trees M(Vj) and M(Vj) share sub-outline trees (in- 
variant sets) corresponding to biconnected components that do not belong to Fl(v i , Vj). 

For instance M(a) and M(p) share invariant sets S 3 , S 4 , S 5 , S 6 (see Fig. 5, Fig. 6). 

Corollary: If V; and v. belong to a same biconnected component S k then all bicon- 
nected component layout are the same in M(v ; ) and M(Vj) excepted S k layout. 



5.2 Merging Multi-level Outline Tree of Biconnected Components 

We propose an algorithm to improve multi-level outline tree computing: 

- We first compute adjacency tree T that nodes are biconnected components S ; . 

- Let Vj be user-focus that belongs to biconnected component S k . We compute 
multi-level outline tree of S k based on v r 

- For each biconnected component S ; (with f£k), we consider the closest articula- 
tion node from v, . We compute multi-level outline tree of S | based on this node. 

- Merging multi-level outline trees for all biconnected components provides a 
global multi-level outline tree of graph G. 

For instance, considering user-focus p (see Fig. 6), we merge multi-level outline 
trees of biconnected components S 3 , S,, S 3 , S 4 , S 5 , S 6 based on nodes a, i, p, h, n, g. 
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Fig. 4. Graph decomposition in biconnected components - tree T 

s 5 




\ 



S3 



Fig. 5. Multi-level outline tree M(a) 




Fig. 6. Multi-level outline tree M(p) 
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5.3 Overview Versus Local View 

When the number of documents increases it may be difficult to visualize information 
like title in the overview (see Fig. 8). In this case we can apply a tree layout algorithm 
13 to expand or collapse a meta-node and its sub-tree. We can also use a filtering 
technique. For instance we display only clusters with high connectivity. 

Fisheye techniques provide also focus + content views. We presented in a previous 
work our main visualization and interaction paradigms 10. 



6 Citation Graph Exploration 

We apply our multi-level clustering algorithm to organize and explore citation graphs 
collected on Researchlndex (Citeseer), a scientific literature digital library 12. We 
used a robot to explore Researchlndex database based on links between documents. 
Note that Citeseer proposes different types of links: “related documents”, “similar 
documents”, “citations” or “co-citations”. Our method is based on links between 
documents whatever their type. Note that we consider only undirected graphs. 

A study of citation graph structure was presented in 1 based on three hundred 
thousand papers collected on Researchlndex. Authors observed that 90% of the nodes 
form a giant connected component which in turn contains a biconnected nucleus with 
58% of all nodes. They also found that the connectivity of citation graph is extremely 
resilient and is not due to the existence of hubs and authorities. 

In the first citation graph described below, 29 nodes (48%) belong to non trivial 
biconnected components (see dark areas in Fig. 7). Note that we do not get a real 
biconnected nucleus since the graph is too small (only 61 nodes). 




Fig. 7. Graph G - a spring view - invariant sets 
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6.1 Exploration from a Focus Paper 

In this section, we propose an automatic organization of articles in the k- 
neighborhood of a focus paper which title is “Navigation and Interaction in Graphi- 
cal Bookmarks” . In practice we consider that k = 4. We explore iteratively most “re- 
lated articles” at distance 1,2,3 and 4 from the focus. Then we build a citation graph 
with 61 papers. 

In a classical spring layout (Fig. 7) we do not display titles not to overload the 
view. On the other hand, we easily display them in multi-level outline tree (Fig. 8). 

Additional visual tips have been added to the multi-level outline tree : the more a 
node has connections, the darker its background is. Similarly, the stronger a k-cluster 
is connected the darker its background is (see cohesiveness - section 4). 

6.2 Interaction 

Interaction is used to display information dynamically. Node’s relations become visi- 
ble, when a user pointer comes over the node. At the same time, connected nodes are 
also highlighted and a tool-tip displays the entire title of the associated paper. 

User can change focus by simply clicking on a node. The multi-level outline tree is 
then recomputed. We present (Fig. 9) different multi-level outline trees based on three 
different foci: 1, 9 and 25. Titles are not displayed in order to simplify the view. 

We consider the three largest biconnected components denoted A, B and C (Fig. 7, 
Fig. 9). Foci 1 and 9 belong to the same biconnected set A. Consequently sets B, C 
and all sets excepted A are displayed identically in resulting multi-level outline trees. 
Now, foci 1 and 25 belong to biconnected components A and C. Moreover B do not 
belong to the path between A and C. So B is display in the same way in multi-level 
outline trees M(l) and M(25) based on foci 1 and 25. 



6.3 Exploration from a Set of Papers 

Exploring a citation graph from a set of papers (for instance, a search results set or a 
bibliography) may provide a graph with different connected components. So the algo- 
rithm can not be applied directly since a multi-level outline tree is computed from a 
connected graph and a focus node. 

If we add an artificial focus node that we call query node, it is linked to every node 
in the set of papers. Thus, we get a connected graph and we can build a multi-level 
outline tree based on query node. 

Let suppose we are looking for documents about “compound graph”. We compute 
a multi-level outline tree based on query node “compound graph” for k = 3. 

We collect recursively 3 levels of similar papers (according to Citeseer). We build 
a graph with 221 papers and links between them. 

We present (Fig. 10) a multi-level outline tree overview without displaying titles. 
At level 2, we observe two main clusters of papers. The other singleton-clusters (at 
level 2) and resulting sub-outline trees may be removed to simplify the view. 
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7 Conclusion 



Multi-level outline tree seems well suited to browse and organize citation graphs. Its 
double tree structure provides rich overviews of the graph that are easy to explore. 
User focus is either a specific document or a set of documents resulting from a query. 
In both cases, user focus is the root of an adjacency tree of meta-nodes that user can 
further explore. Each meta-node is itself the root of an inclusion tree of k-clusters. 
This makes multi-level outline trees very useful not only to explore citation graphs 
but also to organize search results or bibliographies from different perspectives. 

In this paper, we introduced k-articulation nodes that participate to k-cluster cohe- 
siveness. For that, we extended co-citation and bibliographic coupling, considering 
undirected citation paths between two nodes. We also used graph decomposition in 
biconnected components to optimize multi-level outline tree recomputing when 
changing focus. 
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Fig. 9. Different foci - invariant sets Si 
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Abstract. Communication and collaboration with other people is a major theme 
in the information seeking process. Collaborative querying addresses this issue 
by sharing other users’ search experiences to help users formulate appropriate 
queries to a search engine. This paper describes a collaborative querying system 
that helps users with query formulation by finding previously submitted similar 
queries through mining web logs. The system operates by clustering and rec- 
ommending related queries to users using a hybrid query similarity identifica- 
tion approach. The system employs a graph-based approach to visualize the 
query recommendations. 



1 Introduction 

Information seeking is a broad term encompassing the ways individuals articulate 
their information needs, seek, evaluate, select and use information. In the course of a 
search, the individual may interact with people, manual information systems (such as 
libraries) or with digital libraries. A major theme in the various information seeking 
models is that interaction and collaboration with other people is an important part in 
the process of information seeking and use (e.g. [13] [14]). 

Given this idea, collaborative querying aims to assist users in formulating queries 
to meet their information needs by harnessing other users’ expert knowledge or search 
experience [6] [17]. A common approach in collaborative querying is known as query 
clustering, which is to group similar queries automatically without using predeter- 
mined class descriptions. Such queries are typically stored in user logs, which are 
then extracted and clustered to obtain recommended queries to users. A query cluster- 
ing algorithm could provide a list of suggestions by offering, in response to a query Q, 
the other members of the cluster containing Q. In this way, there is an opportunity for 
a user to take advantage of previous queries and use the appropriate ones to meet 
his/her information need. 

Since similarity is fundamental to the definition of a cluster, measures of similarity 
between two queries are essential to the query clustering procedure. We propose a 
hybrid query similarity measure that exploits both the query terms and query results 
URLs. Experiments reveal that using the hybrid approach, more balanced query clus- 
ters can be generated than using other techniques. Further we describe a prototype 
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collaborative querying system which exploits the hybrid similarity measure to cluster 
queries and a graph visualization approach to represent the query clusters. The system 
gives users the opportunity to rephrase their queries by suggesting alternate queries. 

The rest of this paper is organized as follows. In Section 2, we review the literature 
related to this work. We then present the design and implementation of the collabora- 
tive querying system. A scenario is given to highlight the usefulness of this system. 
Finally, we discuss the implications of our findings for collaborative querying sys- 
tems and outline areas for further improvement. 



2 Related Work 

There are several useful strands of literature that bear some relevance to this work. 
This section reviews literature from these fields. Firstly, a survey of interactive query 
reformulation is provided as the background for this research. Next, a review of dif- 
ferent query clustering approaches is presented. 



2.1 Interactive Query Reformulation Systems 

With the proliferation of online search engines, more attention has been paid to assist 
the user in formulating an accurate query to express his/her information needs. A 
number of approaches have been proposed. One approach is to use interactive query 
reformulation systems which aim to detect a user’s “interests” through his/her submit- 
ted queries and give users opportunities to rephrase their queries by suggesting alter- 
nate queries. Several techniques have been used to incorporate aspects of interactive 
query reformulation systems into the information retrieval process. 

One approach to obtain the recommended queries is to use terms extracted from 
the search result documents. Examples include HiB [5], Paraphrase [1] and Altavista 
Prisma [2], which parse the list of result documents and use the most frequently oc- 
curring terms as recommendations. Some popular commercial search engines, such as 
Altavista [3], Askjeeves [4], Eurekster [9], etc, incorporate term recommendation 
functions in the hope that it can help users reformulate an accurate query and then 
locate relevant content. 

Another approach is collaborative querying. Related queries (the query clusters) 
may be calculated based on the similarities of the queries in the query logs [12] which 
provide a wealth of information about past search experiences. The system can then 
either recommend the similar queries to users [12] or use them as expansion term 
candidates to the original query to augment the quality of the search results [7], Here, 
calculating the similarity between different queries and clustering them automatically 
are crucial steps. This will be discussed in the next section. 



2.2 Query Clustering Approaches 

Traditional information retrieval research suggests an approach to query clustering by 
comparing query term vectors (content-based approach). Various similarity functions 
are available including cosine-similarity, Jaccard-similarity, and Dice-similarity [16]. 
Using these functions have provided good results in document clustering due to the 
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large number of terms contained in documents. However, the content-based method 
might not be appropriate for query clustering since most queries submitted to search 
engines are quite short [21]. A recent study on a billion-entry set of queries to Alta- 
Vista has shown that more than 85% queries contain less than three terms and the 
average length of queries is 2.35 [18]. Thus query terms can neither convey much 
information nor help to detect the semantics behind them since the same term might 
represent different semantic meanings, while on the other hand, different terms might 
refer to the same semantic meaning. 

Raghavan and Sever [15] determine similarity between queries by calculating the 
overlap in documents returned by the queries. This is done by converting the query 
result documents into term frequency vectors. The similarity between two queries is 
then decided by comparing these vectors. Fitzpatrick and Dent [10] further developed 
this method by weighting the query results according to their position in the result list. 
They argue that the beginning of a result list is more likely to include a relevant 
document to the original query. Using the corresponding query results is useful in 
boosting the performance of query clustering in terms of precision and recall [10, 15]. 
However this method is time consuming to execute [15]. Glance [12] thus uses the 
overlap of result URLs as the similarity measure instead of the document content. 
Queries are posted to a reference search engine and the similarity between two queries 
is measured using the number of common URLs in the top 50 results list returned 
from the reference search engine. 



3 A Collaborative Querying System 

We have designed a collaborative querying system based on the query clusters gener- 
ated using the hybrid query similarity measure. In our system, the query clusters can 
be explored using a graph visualization scheme. 



3.1 System Architecture 

Figure 1 sketches the architecture of the collaborative querying system. After captur- 
ing a new query, the system will search for matching documents, which is similar in 
function to traditional information retrieval systems. However, beyond the search 
results, the system will identify related queries and use them as recommended queries 
to users. The recommended queries are displayed together with the search results, 
similar to [1, 2, 5]. Users may further explore the recommended queries by visualiz- 
ing the query clusters which contains the initial query and recommended queries. Our 
query graph visualizer is designed to be an independent agent and can be incorporated 
into different information retrieval systems. Put differently, our collaborative query- 
ing system can provide additional information that a user is originally unaware of so 
that the user can use it to formulate a better query to express his/her information 
needs. 

It can be seen from the architecture that there are three essential processes to ac- 
complish collaborative querying. The first is the query repository construction proce- 
dure which involves query cluster generation. The second process is the query rec- 
ommendation phase that includes related query detection and query graph visualiza- 
tion. The third process the maintenance of the query repository. 
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Fig. 1 . Architecture of the collaborative querying system 



3.2 Query Repository Construction 



In this phase, we need to cluster related queries and save the query clusters into the 
query repository. As discussed, our approach to query clustering uses a hybrid method 
based on the analysis of query terms and query results. Here, two queries are similar 
when ( 1 ) they contain one or more terms in common (content-based approach); or (2) 
they have results that contain one or more items in common (result-based approach). 
The remainder of this section provides definitions of different query similarity meas- 
ures used in our experiments. Our method of constructing query clusters based on 
different query similarity measures is also presented. 

The content-based approach clusters queries by calculating the overlap of identical 
terms between queries. Taking the term weights into consideration, we can use any of 
the standard similarity measures [16]. Here we only present cosine similarity measure 
since it is most frequently used in information retrieval. 



Sim 



cosine {Q,,Q ) = 



X i = l cw X cw 




( 1 ) 



where cw^ refers to the weight of i th common term between Q ; and Qj in query Q ( 
and cWjq; is calculated by TFIDF. 

The results-based approach uses the overlap of result URLs as the similarity 
measure instead of the query content as shown in formula (2). The results returned by 
search engines usually contain a variety of information such as the title, abstract, 
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topic, etc. This information can be used to compare the similarity between queries. In 
our work, taking the cost of processing query results into consideration, we consider 
the query results’ unique identifiers (e.g. URLs) in determining the similarity between 
queries. 



Sim _ result (O ,Q ) 



L 

Max (| U{Q) \,\U ( Q ? \) 



( 2 ) 



where the |U(Qj)| is the number of result URLs for Q ; , and |R ; j| is the number of com- 
mon result URLs between Q ; and Qj 

For the content-based approach, a single query term can represent different infor- 
mation needs. For the result URLs-based approach, the same document in the search 
results listings might contain several topics, and thus queries with different semantic 
meanings might lead to the same search results. Thus, we hypothesize that using both 
query terms and the corresponding results may compensate for the drawbacks inher- 
ent in each method. Hence, the hybrid approach is expressed as: 

Sim _ hybrid (Q,,Q,)= ( 3 ) 

a * Sim _ result (Q ,Q .) + (3 * Sim _ cosine (Q ,Q .) 



where a and P are parameters assigned to each similarity measure, with a+p=l. 

Two queries are in one cluster whenever their similarity is above a certain thresh- 
old. We construct a query cluster G for each query in the query set using the defini- 
tion in (4). 

G ( Q .) = {Q : Sim (Q . , Q . ) > threshold } (4) 

where 1 < j < n; n is the total number of query. 

Note that there are alternative clustering algorithms besides the one used in our 
experiments [8]. Compared with these approaches, our method is relatively less time 
consuming. 

In order to test the usefulness of the hybrid query clustering approach, we col- 
lected 20000 queries from the digital library at Nanyang Technological University 
(Singapore). After preprocessing the original queries, including stop word removal, 
misspelled term checking, etc, there were 16000 queries for our experiments. 

We generated different sets of query clusters based on different approaches. Com- 
putation for the similarity between two queries based on query content (sim_cosine) 
was straightforward using function (1). For sim_result, we posted each query to a 
reference search engine (Google) and retrieved the corresponding result URLs, simi- 
lar to [11]. Since search engines rank highly relevant results higher, we only consid- 
ered the top 10 result URLs returned to each query. The result URLs were then be 
used to compute the similarity between queries according to function (2). For the 
hybrid approach (sim_hybrid), the issue was to determine the values for the parame- 
ters a and p. We used pairs of a and P with the following values respectively: (0.25, 
0.75), (0.5, 0.5) and (0.75, 0.25). Due to space constraints, we only report results for 
a=0.25 and P=0.75 since this pair of values generates the best quality query clusters. 

Recall that the threshold is the minimum value, obtained from a given similarity 
measure, that determines whether two queries should be clustered into to the same 
group. Here, thresholds were set to 0.25, 0.5, 0.7 and 0.9. 

In our experiments, the quality of query clusters is measured using the F-measure 
[21], The F-measure used here examines the overall quality of query clusters by com- 
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bining precision and recall, with the value varying from 0 to 1. The larger the F- 
measure value, the better the quality of query cluster. Figure 2 shows the F-measure 
values of the three approaches. 

Along with the change of threshold from 0.25 to 0.9, the F-measure value of 
sim_hybrid increases from 56% to 77%, sim_cosine increases from 49% to 52% and 
sim_result decreases from 21% to 20%. We see that sim_hybrid generates the best 
results comparing with sim_cosine and sim_result. This confirms our hypothesis that 
a combination of both query terms and result URLs provide a better quality of query 
clusters than using each separately. More experimental results can be found in [11] 




Fig. 2. F-measure for different approaches 



3.3 Query Recommendation 

This process involves identifying related queries in the query repository constructed 
in the previous step and visualizing the related queries. 

3.3.1 Detecting Related Queries 

Given a query submitted by a user, we first search the query repository for related 
queries. These queries are then recommended to the user. Flere a recursive algorithm 
was implemented to search the query clusters in the query repository and the initial 
query will regarded as root node. First, the system will detect the query cluster G(Qi) 
containing the initial query Qi. Given the definition of a query cluster (Section 3.2), 
all its members are directly related to Qi. Therefore G(<20 can be regarded as the first 
level in the graph structure of all queries related to Qi, as shown in Figure 3. Besides 
the query cluster G(<20. the system will further find query clusters containing the 
members of G(<20- For example since QI is a member in the cluster G(Qi), therefore 
the system will compute the query cluster G(QI), which forms the second level as 
shown in Figure 3. This process is iterative and will stop at the user-specified maxi- 
mum level to be searched. In our algorithm, the default maximum value is 5 which 
means the algorithm will only detect the top five levels of query clusters related to the 
root node. Thus the final related queries to Qi might go beyond the members within 
G(Qi). giving a range of recommended queries directly or indirectly related to Qi. 
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Fig. 3. Detecting related queries 



3.3.2 Query Cluster Visualization 

Our system displays query clusters in a graph (Figure 4). The graph edges show the 
relationship between two graph nodes, with the value on the edge indicating the 
strength of the relationship. For example, 0.1 on the edge between the nodes “data 
mining” and “predictive data mining” shows the similarity weight between these two 
nodes is 0.1. In addition, the system offers a control tool bar to manipulate the graph 
visualization area including zooming, rotating and locality zooming. The zooming 
function allows users to shrink or enlarge the graph visualization area. The rotating 
function allows users to view the visualization area from different directions. Finally, 
locality zooming refers to levels of the related queries to be displayed. 

By right clicking on an individual node, a popup menu appears offering a variety 
of options. Firstly, users can use the selected query node and post it to a search engine 
(e.g. digital library at Nanyang Technological University). Recall that the query graph 
visualizer is running as an independent agent and can be incorporated into various 
search engines. Secondly, users may use this query to carry out another round of 
searches across the query repository and detect queries related to the selected one. 
Further, users can expand and collapse each query node on the graph. Note the num- 
ber beside each node that denotes how many child nodes that have not been expanded 
yet. 

The query graph visualizer was implemented using “Touchgraph” which is an open 
source component to visualize information in graph formats [20], 



3.4 Updating Query Repository 

When new queries arrive at the system, there is a need to update the query repository 
so that the system can harness and recommend the latest useful queries. This process 
can be done periodically offline. Figure 5 shows the steps of updating the query re- 
pository. Most of the steps are similar to the query repository construction process 
except the first one - capturing the new queries. This means the newly submitted 
queries will be captured and compared with existing queries and incorporated into the 
query database only if they are unique. For the rest of the steps, refer to Section 3.2. 
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Fig. 4. Query graph visualizer 



4 A Scenario of Use 

The following scenario illustrates one of the potential users of the system and high- 
lights the operation of the system. 

Suppose that a user is interested in the field of data mining and he is a novice in 
this area. When he uses the collaborative querying system, the user first submits a 
query “data mining” to search for information. A moment later, a list of queries re- 
lated to “data mining” is displayed as the query recommendations in addition to the 
search results. After looking through the result list and the recommended queries, he 
wants to generate a query graph using “data mining” as the root node. Thus the user 
triggers the query cluster visualizer. A query graph will appear on the visualization 
area (see Figure 4 for an example). While browsing the graph, he is interested in the 
node “knowledge discovery”. It is a new phrase to him but seems related to his search 
topic. Wanting to peruse the queries related to “knowledge discovery”, he zooms in 
the visualization area by dragging the bar next to the option box from left to right (see 
Figure 6-(a) and 6-(b)). He may also rotate the visualization area to facilitate his 
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browsing (see Figure 6-(b) and 6-(d)). By adjusting the locality level, the user ex- 
pands or collapses the nodes that contain child nodes in order to obtain an overview 
about the whole structure of all the queries related to the root node (see Figure 6-(a) 
and 6-(c)). 




Fig. 5. Updating Query Repository 



Now the user notices that there is a number “3’' near the node “data warehousing, 
data mining and OLAP” (see Figure 6-(d)). The number here indicates that this node 
has two child nodes which have not been expanded. He right clicks on the node and 
chooses ‘expand this node’ on the popup menu. Note this action will only affect the 
selected node while the locality zooming option discussed previously take effect 
across the whole visualization area. After examining the graph carefully, the user is 
prepared to carry out another around of information retrieval by using the node 
“knowledge discovery”. He thus right clicks on the node and chooses “display result 
in a separate browser”. The query “knowledge discovery” will be posted to the search 
engine automatically and the results will be displayed in a separate browser. He may 
repeat this process until he finds the desired information. 



5 Conclusions and Future Work 

In this paper, we first compared different query similarity measures. Our experiments 
show that by using a hybrid content-based and results-based approach, considering 
both query terms and query result URLs, better query clusters can be generated than 
using either of them alone. We then introduced a collaborative querying system which 
utilizes the hybrid query similarity measure to generate query clusters for each query. 
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We described the design and implementation of a collaborative querying system 
based on the query clusters. Our work can contribute to research in collaborative que- 
rying systems that mine query logs to harness the domain knowledge and search ex- 
periences of other information seekers found in them. Firstly we propose a hybrid 
query clustering approach which differs from [10, 12, 15] since all them do not use 
the query content itself. Secondly, we employ a graph-based approach to visualize the 
recommended queries which differs from [1, 3, 4, 12] since all them only adopt text 
or HTML to display the recommended queries. 
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Fig. 6. Query cluster visualization 



In addition to the initial experiments performed in this research, alternative ap- 
proaches to identifying the similarity between queries will also be attempted. Adap- 
tive elements will be introduced to reflect the growing and changing nature of the 
collection of documents to ensure the quality of query clusters when using the results- 
based approach. In addition, word relationships like hypernyms can be used to replace 
query terms before computing the similarity between queries. Finally, a user evalua- 
tion to test the usefulness and usability of the collaborative querying system will be 
conducted. 
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Abstract. The paper presents a comparative analysis of data harvesting 
and distributed computing as complementary models of service delivery 
within large-scale federated digital libraries. 

Informed by requirements of flexibility and scalability of federated ser- 
vices, the analysis focuses on the identification and assessment of model 
invariants. In particular, it abstracts over application domains, services, 
and protocol implementations. 

The analytical evidence produced shows that the harvesting model offers 
stronger guarantees of satisfying the identified requirements. In addition, 
it suggests a first characterisation of services based on their suitability 
to either model and thus indicates how they could be integrated in the 
context of a single federated digital library. 



1 Introduction 

As digital libraries grow to accommodate more resources and users, their ar- 
chitectures embrace distribution and, in the process, discover the observables of 
the federation: a widely dispersed and loosely coupled system of cooperating but 
otherwise mutually autonomous parties. 

1.1 Federated Digital Libraries 

Federated digital libraries, or FDLs, are the subject of increasing development 
efforts across the globe: from subject-based and sector-based international ini- 
tiatives - such as the Open Language Archive Community initiative [1] - to 
grand, cross-sectoral, and nationally-scoped initiatives which account for a large 
part of the current development and research efforts within the field including 
the JISC’s Information Environment in the UK [2], the SURF’s Digital Aca- 
demic Repository in Netherlands (DARE) [3], the ARIIC’s Information Infras- 
tructure in Australia [4], the NSF’s National Digital Library for Science Edu- 
cation (NSDL) [5,6] and the Networked Computer Science Technical Research 
Library (NCSTRL) [7] in the US, and the Deutsche Initiative fiir Netzwerkin- 
formation (DINI) [9] in Germany. 
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Admittedly, distribution is not a necessary implication of scope, and large- 
scale resource sharing may still rely on a centralised design. This is, for exam- 
ple, the approach adopted by the learning object community in UK for the in- 
progress development of the nation-wide JORUM repository [10]. Exceptions to 
the federated approach, however, are best interpreted as interim and exploratory 
solutions intended to mitigate the challenges of interoperability whilst fostering 
the formation of large communities of users. It is then anticipated that the cost 
of adequately serving such communities requires the organisational and technical 
support of a distributed infrastructure of local administrative domains. 

In the absence of centralised content, the identity and raison d’etre of a FDL 
lie exclusively in its service provision layer. It is through their services that FDLs 
hope to improve over the ubiquitously deployed and extremely popular services 
of another, globally distributed, and yet largely unmanaged federation, namely 
the World Wide Web. The goal is clear: by reflecting the needs and leveraging the 
means of comparatively smaller and more cohesive communities, FDLs set out 
to challenge the scope and accuracy of existing Web services, primarily search 
engines. The strategy is also clear: to build federated services against structured 
descriptions of resources, that is metadata , rather than the resources themselves. 
The underlying assumption - to date unqualified and largely untested - is that 
a structured approach will fare better than content-based or link-based analy- 
sis. Given the predominant implementation strategy, it is indeed suggestive to 
think of FDLs as ‘mini-webs’, more focused, homogeneous, and thus potentially 
functional subsets of the HTTP-based Web on top which they are conceptually 
and technically layered. 

Service provision is also where FDLs meet most directly the challenges of 
interoperability. A federated service faces the heterogeneity of tools, policies, 
means, and largely purpose which derives from the foundational assumption of 
autonomy across participating parties. From a technical perspective, it must be 
able to accommodate significant variations in metadata syntax, semantics and 
exchange protocols. From an organisational perspective, it must also account 
for often dramatic variations in resource allocation, technical know-how, and 
local and community-wide agendas. Further, a federated service is expected to 
meet the qualitative requirements which its users normally associate with the 
provision of Web services, and to do so as the parties, resources, and users in 
the FDL scale up towards largely unknown bounds. 

1.2 Distributed Computing and Data Harvesting 

Informed by the core requirements of flexibility and scalability, this paper looks 
into technical models for the provision of federated services. To limit an otherwise 
prohibitive scope, it ignores issues related to metadata quality and metadata 
semantics and focuses instead on models of service delivery in the presence of 
distribution. 

In its most generic form, the problem of service delivery is one of computing 
over widely distributed data and, as such, it admits either one of two comple- 
mentary solutions. In distributed computing , the computation (i.e. the service) 
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is distributed along with the remote data (i.e. the resource metadata), while in 
data harvesting the data is first gathered and then computed over locally. 

Until recently, the distributed computing model has received most of the 
theoretical and practical attention, both within and outside the field. Its use for 
resource discovery, in particular, has been standardised and widely tested within 
the library community through, respectively, specifications and implementations 
of the Z39.50 protocol [11]. More modern, lightweight, and web-oriented inter- 
pretations of the model - most noticeably the SDLIP [14], SRW/SRU [13], and 
SQI [15] protocols - are also becoming increasingly popular. 

At least in principle, the harvesting model is also familiar within the field. 
In diverse, domain-specific, and often implicit guises, it can be recognised as the 
approach underlying many physical union catalogues and all web-based search 
engines. First proposed and indeed named in the context of scalable architectures 
for Web-wide search services [21], harvesting can now count on an application- 
independent specification which has become the standard de-facto for a rapidly 
increasing number of implementations, namely the OAI-PMH protocol of the 
Open Archive Initiatives [16]. 

While both models are well represented in the field, early experimental ev- 
idence (e.g. [19], [20]) suggests that the harvesting model offers stronger guar- 
antees to meet the service requirements of flexibility and scalability. The FDL 
initiatives mentioned in Section 1.1 vary substantially in terms of scope, architec- 
tural detail, and ultimately design philosophy; nonetheless, they have all chosen 
harvesting as the preferred model for the delivery of their services. One, the NC- 
STRL initiative, has recently undergone a phase of redeployment to replace its 
mechanisms for distributed computing with mechanisms for data harvesting [8] . 

1.3 Motivations and Outline 

In the light of such extensive support, it is perhaps surprising that a high- 
level, comprehensive, and principled case for metadata harvesting within FDLs 
has not yet found, to the best of the author’s knowledge, a dedicated place in 
the literature. Granted, terse references to the ‘simplicity’ and sometimes ‘ef- 
ficiency’ of metadata harvesting are nearly ubiquitous in publications related 
to the OAI-PMH protocol (e.g. [17], [18]). Similarly, complementary problems of 
‘complexity’, ‘poor performance’, and ‘limited interoperability’ of Z39.50 have 
been repeatedly flagged some time before the advent of harvesting, most notice- 
ably in relation to virtual and physical union catalogues (e.g. [22], [23]). Finally, 
some design considerations on the applicability of the two models, again pre- 
dating the OAI specifications, may be discovered in service-specific consultancy 
reports (e.g. [24]). 

Partly, the goal of this paper is to collect, expand, contextualise, and in- 
stantiate the arguments that have been produced so far. Even when the sparse 
analytical evidence is collated, however, it is unclear whether the identified prop- 
erties are accidents of specific protocol implementations and services, or whether 
they can be considered as invariants of the models underlying those protocols. 
Subsequently, it remains difficult to characterise the application domains and 
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services which suit one model rather than the other and thus support decision 
makers in their choice of service delivery protocols. 

In an attempt to fill this gap, the paper presents a comparative analysis of the 
two models which is independent of the application domains in which they are 
used, the services which adopt them, and the protocols which implement them. 
In particular, the paper seeks answers to questions like ‘if the OAI-PMH is to be 
preferred over Z39.50 for a given federated service, is it also more indicated than 
SRW/SRU for the same service?’ and ‘what services are better accommodated 
by the OAI-PMH and which ones suit instead a Z39.50-based or a SRW/SR.U- 
based approach?’, and again ‘how can the two models coexist in the context of 
a single FDL?’. 

Presented in Section 2, the analysis is carried out in five steps. Section 2.1 
contextualises the general requirement of flexibility to the case of service deliv- 
ery and shows how it can be approximated by the simplicity of delivery models. 
Section 2.2 discusses the degrees of complexity of the two models under exami- 
nations, Section 2.3 illustrates the manifestations of such complexity within an 
FDL, and Section 2.4 outlines the potential for scalability associated with the 
models. Section 2.5 considers their limitations in terms of functionality and how 
these limitations may inform a characterisation of services in relation to their 
suitability to either model. Finally, Section 3 draws some conclusions and relates 
service delivery models to other aspects of interoperability. 

2 Analysis 

One way of capturing the complementary nature of data harvesting and dis- 
tributed computing is by noticing that while the former localises service provi- 
sion within the FDL, the latter spreads it across all the federated parties. This 
Section shows how this simple observation bears profound consequences in terms 
of both flexibility and scalability of federated services. 

2.1 Flexibility as Simplicity 

In any deployment scenario, the technical and organisational costs associated 
with the complexity of a given solution - whether a metadata model, a ser- 
vice delivery model, or a service delivery protocol - must be carefully measured 
against the gain in functionality that justify them and the heterogeneity of the 
community that must absorb them [28]. The price of misjudgements is a parti- 
tioning of the community intended for that solution. 

In principle, any given degree of complexity identifies a sub-community of the 
initially intended one, potentially excluding: (i) members who cannot sustain the 
solution or do not want to in response to functionality deemed unnecessary ( the 
solution is too complex), or (ii) members who desired and could have sustained 
a higher degree of functionality ( the solution is too simple). If the community of 
adoption does not have or assume significance with respect to the one initially 
targeted, the solution fails and tends to be progressively abandoned. At best, 
the solution is re-purposed within a narrower scope, and the problem for which 
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it was originally conceived remains an open one. This is, for example, the case 
of the Dublin Core metadata model, which was originally intended for resource 
description over the unmanaged Web and it is now re-purposed within more 
disciplined FDLs. 

Undoubtedly, the diversity of organisational structures remains a primary 
observable within FDLs and thus the simplicity of solutions intended for FDLs 
is to be treasured above the functionality they can offer. When it comes to service 
delivery, in particular, the simplicity of a model translates into a measure of its 
flexibility. Simply put , a flexible model for the delivery of federated service should 
present a ‘low barrier’ to the interoperability of federated parties. 

2.2 The Causes of Complexity 

Notice now that distributed computing requires that each federated party partic- 
ipate of the implementation, deployment, and maintenance of all the federated 
services its metadata contributes to. In contrast, harvesting requires only that 
federated parties be able to disclose the metadata they hold, a task which is in 
general much simpler than service provision and, most importantly, one which 
offer more resilience across different federated services. 

Consider, for example, a federated service for resource discovery. In a dis- 
tributed computing interpretation of the service, federated parties must be able, 
at the very least, to parse, translate, and execute all the queries submitted to 
the service by its users. In addition, the service requires that the parties return 
query results in a format the service is willing to accept, and thus that parties 
be potentially engaged in data transformation tasks. Depending on the service 
functionality, the service may also require that parties perform additional func- 
tions, such as management of the result set (e.g. filtering, ordering, browsing, 
providing statistics, etc). 

Different are the demands parties must satisfy with a harvesting interpreta- 
tion of the same service. Besides the potential data transformation tasks which 
are necessary in any data exchange scenario, federated parties are required at 
most to recognise and execute a small and fixed number of simple queries to 
scope the disclosure of their metadata. In particular, they are not expected to 
parse and interpret the expression of a full-fledged and potentially complex query 
language. 

The simplicity of disclosure over full service provision should not be consid- 
ered in the limited context of single service, as it is normally done. Rather, it 
should be viewed in the common assumption that federated parties will con- 
tribute to more than one service within the FDL, where different services may 
offer: (i) different functions (e.g. resource discovery, citation linking, metadata 
enhancement, current awareness, etc.), or (ii) specialise similar functions to the 
needs of different sub-communities within a single FDL (e.g. cross-community 
resource discovery versus learning object or eprints discovery), or (iii) simply 
compete on the basis of additional added value services (e.g. user interfaces, 
service customisation, etc). 

In a ‘multi-service’ scenario, the additional complexity of the distributed 
computing approach leaves more room for variations across services and thus 
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place higher costs on the ‘mobility’ of federated parties across different services. 
When moving across different resource discovery services, for example, a feder- 
ated party may need to process different query languages and perform different 
result management functions as well as carry out different data transformation 
tasks. In contrast, only the costs associated with the latter may be faced by 
a federated party which simply discloses its metadata. For example, a party 
that discloses simple Dublin Core metadata for resource discovery will face no 
additional costs when ‘moving’ to another DC-based discovery service and in 
fact to any other service which relies on the same metadata format. Even when 
the party does have to translate its own metadata into other formats than DC 
(e.g. IEEE LOM), the availability of a FDL-wide syntactic interoperability so- 
lution - normally one based on the XML standard - implies that the costs are 
incremental rather than ex novo. 

2.3 The Costs of Complexity 

Once the complexity of the distributed model has been ascertained, one may 
consider the effects of that complexity within the FDL. Obviously, complexity 
raises implementation costs and thus tends to limit the number of available im- 
plementations to those produced by resourceful parties and commercial vendors. 
Even when free implementations are made available, the tight coupling between 
the functions of any delivery model and the metadata back-end of individual 
parties makes off-the-shelf reuse an elusive goal and does not eliminate the need 
of installation, customisation, and maintenance tasks. 

Another way in which complexity undermines interoperability is by increasing 
the possibility of incomplete or erroneous specifications whilst reducing their 
understandability. In particular, complex protocol specifications are prone to 
unstable releases, problems of backward compatibility, and mutually inconsistent 
implementations . 

Most importantly, complexity amplifies almost invariably problems of seman- 
tic interoperability within the model [25]. Full service provision, in particular, 
multiplies the requirements of semantic alignment between federated parties and 
thus is more prone to breaking interoperability through inconsistent implemen- 
tations of the model. For example, the lack of interoperability between z39.50 
targets caused by differences in mappings of search attributes onto local database 
indexes, extraction and normalisation algorithms for search keys, and stopwords 
handling is well documented in the literature (e.g. [22]). 

To avoid the problem, services may make a degenerate, almost ‘syntactic’ 
use of the model [12], which is suitable only for high-level meta-services not 
oriented to the end-user (e.g. server implementation browsing). Alternatively, 
they may restrict their scope to all the federated parties which comply with 
some community-specific instantiation of the model. Instantiations may concern 
the query language, the format of the metadata, or the support for optional 
functionality, and may be approached in a number of ways, including profiling 
[29] and MOP-based expansion and refinement [14] . Normal practice is then to 
mandate support for a minimal instantiation to support the implementation of 
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federated services against the greatest common denominator of the implemen- 
tations deployed at the federated parties (e.g. [26], [33], [27]). 

Clearly, the harvesting model is not immune to the interference of semantics 
with service deliver and thus does not obviate the need for a ‘spectrum of in- 
teroperability’ solutions within the FDL [6]. By limiting such interference to a 
profiling of metadata formats , however, harvesting simplifies the organisational 
aspects of the profiling process whilst maximising the scope of the community 
which adopts the profile within the FDL. 

2.4 Scalability 

Section 2.2 and Section 2.3 have shown that an approach to interoperability 
based on the harvesting model promises to contain service deployment costs 
within FDLs. The model, however, is also beneficial for service implementation, 
for it delivers all the good properties which are normally associated with local 
computations. 

With harvesting, in particular, the diverse capabilities of federated parties 
and the observables of the network may be factored out real-time interactions 
with the end-users and be faced instead off-line, possibly through flexibly config- 
urable processes [16]. Latency-inducing factors associated with slow, congested, 
or simply unavailable connections have virtually no impact on the reliability and 
responsiveness with which a service interfaces its users. 

In contrast, a service distributed across the FDL is intimately dependent on 
the federated parties and the underlying network, and thus tends to be con- 
strained by the performance of the ‘weakest’ party and the fluctuations of the 
available bandwidth. The fact that parties and network are in principle required 
to sustain the full service load (e.g. all the user queries submitted to a discov- 
ery service) cannot but worsen the situation. Experimental evidence indicates 
that the performance of basic implementations of distributed discovery services 
tend to rapidly decrease as the number of participating parties grows beyond 
10-15 [30]. 

Admittedly, manual or automated clustering techniques [31], proxy-based 
solutions [14], and replication strategies [7] may help to more equally distribute 
the service load across the FDL. However, the pragmatic and intellectual costs 
of scaling these approaches against the number and capabilities of participating 
parties are largely unclear but promise to raise significantly the overall costs of 
the FDL infrastructure. Significantly, advanced implementations of distributed 
services have been so far confined to the prototypal domain. 

If reliability is largely related to distribution, performance may be also influ- 
enced by network-independent requirements. With distributed computing, the 
costs of pre-processing the metadata received from participating parties before 
presenting it to the users must also be accommodated in real-time. The result- 
ing penalties discourage or severely limit the possibilities of metadata trans- 
lation, de-duplication, versioning, and enhancement which are so important in 
the diverse environment of the FDL. It is hard, for example, to imagine dis- 
tributed services with consolidation capabilities which go beyond straightforward 
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identifier-based de-duplication [22]. Again, the harvesting model allows to hide 
these inherently difficult and computationally intensive processes away from the 
users and, by doing so, paves the way for a family of middleware services which 
remain instead elusive in the distributed computing scenario. 

Of course, the harvesting model raises its own scalability issues. A federated 
service based on harvesting operates on a centralised copy of the remotely dis- 
tributed metadata and thus may rapidly become large in response to the number 
and growth of participating collections. However, the costs associated with local 
scalability are relatively lower when compared with those raised by network- 
based solutions. Equally important, the technical processes required for local 
scalability are well understood and require opportunistic intervention on vari- 
ables which are entirely under the control of service implementors (e.g. memory, 
disks, processors, local networks) [22], 

Clearly, more experience is needed to identify the limits of the harvesting 
approach beyond the positive results of early experimental services [19]. How- 
ever, it may be argued that no realistic degree of scalability can be predicated 
on soaring costs. In this sense, the very existence of comprehensive and long- 
established physical union catalogues (e.g. [24], [32]) and Web search engines 
suggests that, whatever may be the precise limits of harvesting, these may be 
approached at relatively contained costs. 

2.5 Functionality 

In the light of the principles presented in Section 2.1 and the advantages at- 
tributed to harvesting over distributed computing in Section 2.2 and Section 2.3, 
it is interesting to observe that neither model enables more functionality than 
the other within the FDL. 

At first, this statement may appear controversial for - by exposing the func- 
tionality of specific services - even the most streamlined server-side implemen- 
tations of the distributed computing model (e.g SRW/SRU) are more expressive 
than any server-side implementation of the harvesting model. Indeed, the ad- 
vantages associated with the simplicity of harvesting are ultimately predicated 
on this argument. However, from a broader, service-oriented perspective - and 
thus from a client-side perspective - the situation is quite different. 

In a strict computational sense, the harvesting model enables more expressive 
federated services than are possible under the distributed computing model. 
The reasons for this are largely those discussed in Section 2.4, and relate to 
the limited possibilities of metadata pre-processing which are allowed under the 
distributed computing model. Not only does this apply to processes which are 
theoretically possible but pragmatically unfeasible under that model (e.g. format 
translation). It also applies to processes - such as advanced consolidation and 
standard ranking algorithms [22]- which require the totality of the remotely 
distributed metadata and cannot rely solely on the responses of participating 
parties to individual service transactions (e.g. queries). Put another way, not 
all computations can be distributed across the disjoint union of participating 
metadata collections. 
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The harvesting model, on the other hand, relies on a mono-directional infor- 
mation flow from data providers to service providers and is thus bound to the 
subclass of service architectures which can be gracefully accommodated within 
this assumption. Whenever the intended functionality requires information to 
flow in the opposite direction or in both directions - and thus relies on a dif- 
ferent distribution of roles between communicating parties - harvesting looses 
much of its appeal. 

One case which defeats the harvesting approach is when data exhibits an 
extremely dynamic nature. As an example in the classic library domain, consider 
the needs of union catalogues which whislr to offer circulation data along with 
bibliographic data. Here, harvesting is not an effective solution for the harvesting 
rates required by the dynamicity of circulation data would prove so intensive to 
essentially reintroduce the network as a real-time observable of service provision. 

Similarly, harvesting has little to offer for the implementation of a local in- 
terface to a remote service (e.g. a local Z39.50 interface to an existing discovery 
service), even if the latter had facilities in place to offer its data for third-party 
harvesting. Here, the local interface is best viewed as an extension of the remote 
service and no clear distinction between data and service provision can be made. 
In particular, local harvesting of the remote data would simply reintroduce de- 
ployment costs which have been already absorbed within the FDL. In contrast, a 
two-party dialog is an ideal and indeed prototypical application scenario for the 
distributed computing model [12] and one in which the problems of inter-party 
interoperability discussed in Section 2.2 simply do not arise. 

There are services, accordingly, which are - in any practical sense - out- 
side the scope of the harvesting model and yet may play an important role 
within the FDL. It should be noted, however, that such services rely on strong 
agreements between communicating parties which can only be expected within 
tightly-coupled subsets of the FDL. Put another way, these services operate 
within the FDL but do not belong to the category of truly federated services. 

3 Conclusions 

Data harvesting and distributed computing may both serve as models of service 
delivery in the context of large-scale federated digital libraries. 

Harvesting clearly separates the concerns and responsibilities of data pro- 
viders from those of service providers, while distributed computing views data 
provision and service provision as inherently overlapping processes. In partic- 
ular, harvesting induces a 2-phase view of service delivery which distinguishes 
the aspects related to communication - which involve both service and data 
providers - from those that relate to service-specific implementation - which in- 
stead concern only service providers. In contrast, distributed computing collapses 
communication and service-specific implementation within a single protocol of 
interaction. 

That communication between service and data providers may take place in 
conceptual isolation from service-specific implementation is beneficial to data 
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providers, for it shifts the costs of their participation where they are expected 
to be affordable, at the service providers. Vice versa, service implementation 
benefits from abstracting over communication, for it can deliver all the good 
properties normally associated with local and off-line computations. 

For these reasons, the harvesting model offers stronger guarantees to meet 
requirements of flexibility and scalability of federated services. In contrast, the 
distributed computing model offers complementary support for services that 
operate within more cohesive subsets of the federated library. 

To conclude, it is worth noticing that the harvesting model offers little help 
with semantic issues of metadata interoperability: successfully exchanged meta- 
data must still be uniformly understood. In particular, the model alone cannot 
guarantee a uniform implementation of federated services against metadata mod- 
elled according to different models, formats, profiles, and standards. The model 
abstracts over the complexity of the metadata which may be harvested within 
sub-communities of the federated library and thus reflects a bipartite conceptual 
model which helps to more clearly separate, and thus tackle, different pieces of 
the interoperability jigsaw. 
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Abstract. With the continued growth of the Open Archives Initiative Protocol 
for Metadata Harvesting (OAI-PMH) [1] it has become increasingly difficult 
for OAI service providers to discover new and keep up-to-date with existing 
data providers. There are currently several registries of OAI data providers. 
Most of these registries are incomplete. Most contain minimal information 
about registered providers - typically a base URL and little if anything else - 
providing service providers no clue as to repository scope, content, or size. 
These deficiencies mean significant extra overhead for service providers. This 
paper describes a more comprehensive registry of OAI data providers (available 
at http://oai.grainger.uiuc.edu/registry), developed to address some of these is- 
sues. While our registry as it presently exists facilitates discovery of data pro- 
viders, utility is limited by lack of consistent practice for collection-level meta- 
data. To realize the full potential of a better registry, the OAI community needs 
to develop better practices for collection-level description. 



1 Why Another OAI Registry? 

We developed our own OAI metadata provider registry to better support a range of 
OAI-based projects at the University of Illinois Libraries. These projects have in- 
cluded the Mellon funded UIUC Digital Gateway to Cultural Heritage Materials 1 , the 
Grainger Engineering Library’s OAI Search Portal for Engineering, Computer Sci- 
ence, and Physics 2 , the IMLS Digital Collections and Content project 3 , the NSDL 
Second Generation Digital Mathematic Resources project 4 , and most recently the 
CIC-OAI Metadata Harvesting Service project 5 . Especially for those projects which 
are building focused OAI-based services, it has became clear that a significant amount 
of effort is required to discover relevant data providers and/or relevant sets within a 
single data provider. This has usually involved manually browsing data providers 
whose base URLs are listed in one of several existing registries such as that main- 
tained at the Open Archives Initiative web site 6 or at the OAI Repository Explorer 



1 http://oai.grainger.uiuc.edu/ 

2 http://gl 18.grainger.uiuc.edu/engroai/ 

3 http://imlsdcc.grainger.uiuc.edu/ 

4 http : //nsdl . grainger. uiuc . edu/ 

5 http://cicharvest.grainger.uiuc.edu/ 

6 http://www.openarchives.org/Register/BrowseSites.pl 
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